[aprssig] distributed findu possible ?

Sun Aug 10 11:34:57 EDT 2008

On Aug 10, 2008, at 9:02 AM, Matti Aarnio wrote:
>
> One of the reasons that people have no idea of what  findU  can do,  
> is its
> "user interface".    Indeed you have supplied only backend of  
> things, no
> frontend at all,

Frontend and backend have specific meanings in dynamic web systems,  
typically the backend is the database and the frontend is the web  
server. In larger systems these are often on different physical  
machines. Under that standard definition there is indeed a front end  
on findU. And I specifically disallow anyone from using findU as a  
backend.

I take your meaning to be that findU does not have a user-friendly way  
to generate the URLs. That is very intentional. findU is a worldwide  
system, I do not have the resources to localize a user interface in  
different languages. On the other hand, it is relatively simple to  
create forms that generate the URLs. It was and is my hope to get more  
people involved in creating APRS internet resources by allowing them  
to create their own form pages to generate the findU URLs. A handful  
of people have, in a few languages, and I link the ones I know of on  
my front page. I'd still like to see more.

> and on some details like how long data is retained the
> information is not given anywhere that I can spot

That's because I need to vary it from time to time, as disk space  
wanes and the APRS IS traffic rises. I can't even keep up the info  
already on there, just noticed the front page says the database is 58  
GB in size, that is really old info ;-)
>
> It is much much easier to point aprs.fi's map for the general area  
> of interest,
> and then look at what happens around there.

Of course it is. I wish I had time to program a full gmap  
implementation on findU. My point though is you cannot call something  
a distributed findU if it only has the easy features of findU. The  
database aspect of the aprs.fi front page is trivial. In fact, I was  
doing it in memory, without a database backend 12 years ago as part of  
APRServ, the original APRS hub program.

I'm not saying aprs.fi is not useful, or wrong, or anything of any  
sort negative. I'm simply saying that you cannot talk about something  
as a findU analog if it only cherry-picks the easy stuff.

> In the end the raw data may not live in the system for very long, but
> those end-product views are longer-term data.
> Like:
>
>  http://aprs.fi/weather/OH2KXH/year
>  http://aprs.fi/telemetry/OH2RDK-5/month

Talk about hard to find info, there is nothing on the home page that  
indicates this is available on aprs.fi. At least findU has a list of  
available cgi's and their parameters. This is better than I though was  
available there, though I still don't see a way to get anything other  
than the handful of preset views. Is there a way to show a detailed  
plot of high resolution data for an arbitrary time?

>> How are you going to show month+ long tracks?
>
> That all means that:
>  - Data is kept on persistent database  (no ram-only nodes)
>  - Its insertion must be cheap (as "quick")
>  - Its retrieval must be cheap (which may make the insertion less  
> cheap...)
>
> Disk space keeps growing, still the disks can handle only so many IO
> operations per second because moving IO heads along the disk surface  
> and
> spinning the disks themselves do take roughly the same time now that  
> they
> took 10 years ago.   Thus a single terabyte disk is no _faster_ to  
> do IOs
> than single 10 GB disk.

Assuming all other parameters are identical, that is true. My first  
hard drive was a 16 MB (yes, megabyte) drive (the size of a shoebox) I  
paid $3000 for in 1979. I can assure you, its throughput was far below  
even the slowest drive you can buy today. All drives are not created  
equal.

High end servers use drives that spin faster (less waiting for the  
data you want to rotate under the head and short time needed to read  
and write a chunk of data) and have faster seek times (shorter time  
until the drive can start looking for the right sector), which are  
much faster than consumer class drives. findU uses six 146GB drives in  
a RAID 1+0 array. Data is evenly split between three pairs of drives.  
The striping of data into three groups in the RAID 1. Each bit of  
incoming data is written onto both drives of a pair, this mirroring is  
the RAID 0. Since each drive in a pair has identical data on it, reads  
can happen from either drive. So, each drive must handle one third of  
the writes and only one sixth of the reads. Combine this RAID  
performance with the high end disk performance, and you get a system  
that can handle maybe 10 times the throughput of a consumer drive. Not  
cheap, and not high capacity (my desktop Mac has 4 times the storage  
space of the findU servers), but fast.

And I disagree that there is no change over 10 years in speed. At the  
low end, while the emphasis has indeed been on increasing capacity,  
there have been improvements in speed. Ten years ago even a desktop  
did not often have 7200 RPM drives, now I have one in my laptop. At  
the high end there have been large improvements in speed and less in  
capacity.

>
>
> One needs to have multiple disks for: data mirrors so that single  
> disk can
> fail without data loss or even service loss, _and_ for IO  
> parallellism.

IO parallelism is about speed. Once you have parallelism that travels  
the internet, you lose a lot of speed. The fastest ping time is longer  
than the slowest seek time. If you use a distributed database that is  
not within a single data center, user experience will suffer. I don't  
consider alexa.com reliable for traffic rankings because of their non- 
random sample, but they have a good metric for response time. I'm  
proud they rank findU as very fast, at 0.7 seconds it beats 87% of web  
sites. As reference arrl.net is 3 seconds and qrz.com is 5 seconds.  
aprs.fi and aprsworld do not have numbers because they fall below the  
rankings at which alexa performs speed test. There are many studies  
that show more than a couple seconds response time adversely colors  
users' perception of a web site.

When looking at reliability for a distributed system, you need to look  
at the reliability of each server to decide how much redundancy you  
need. No matter what, you need two copies of each bit. With a low  
reliability system (not just hardware, but with this volunteer system  
Joe goes on vacation and turns off his computer, or there is an ice  
storm and he loses power or internet), you probably want at least  
three copies. So if you want each server to have a hundredth of a  
findU amount of data, now you need 300 machines. Plus you need a way  
to recognize when one becomes unavailable and mirror the data onto  
another server. Just another feature to add into the magical central  
control of the system.

I haven't heard, who is going to write this? ;-)

Steve K4HG