[aprssig] distributed findu possible ?

Sun Aug 10 09:02:10 EDT 2008

On Sat, Aug 09, 2008 at 06:17:14PM -0400, Steve Dimse wrote:
> From: Steve Dimse <steve at dimse.com>
...
> See my last reply for some findU stats. Keep in mind what you are  
> talking about here is cherry picking the easy stuff findU does.

Steve,

One of the reasons that people have no idea of what  findU  can do, is its
"user interface".    Indeed you have supplied only backend of things, no
frontend at all, and on some details like how long data is retained the
information is not given anywhere that I can spot.

It is much much easier to point aprs.fi's map for the general area of interest,
and then look at what happens around there.

For that matter,  things like CWOP really are rather invisible, and
doing data accumulation and relay to NOAA is not something one would
want to with distributed systems.   (Not that I believe too strongly
on the extreme distribution idea at all...)

Knowing how rich swamp of encoding formats the APRS packets are, incoming
packets must be pre-parsed for position, symbol, etc. information before
feeding all that data into database, and then have some _smart_ ways to
index those parse results so that one can find "all APRS positions within
20 mile radius of position X,Y", or "all APRS entities with symbol S", or
whatever there may be.   Plus time-ranges..  Plus application specifics,
like WX and Telemetry.

I have had a small peek at what DWH is, and how things are handled there.
Raw data goes in, gets transmutated in a number of ways, and is viewable
via "product tables".

In the end the raw data may not live in the system for very long, but
those end-product views are longer-term data.
Like:

  http://aprs.fi/weather/OH2KXH/year
  http://aprs.fi/telemetry/OH2RDK-5/month

I don't know how long the data is truly kept at aprs.fi system, but raw
data is purged a lot sooner than analysis products.

> This is what aprs.fi does, and to some extent aprsworld, but you
> can't call  it a findU replacement unless it does the hard stuff.
> What are you plans for handling:
> 
>   http://www.findu.com/cgi-bin/wxpage.cgi?call=K4HG&date=20051023&last=30
> 
> This is a three year old plot of the data at my house just before  
> Hurricane Wilma hit and either the DSL line went down or the UPS gave  
> out after power failure. You can get this for any weather station for  
> any time in the last eight years that it sent data to the APRS IS. Or
> 
> http://www.findu.com/cgi-bin/track.cgi?call=w7lus-14&geo=usa.geo&start=99999
> 
> How are you going to show month+ long tracks?

That all means that:
  - Data is kept on persistent database  (no ram-only nodes)
  - Its insertion must be cheap (as "quick")
  - Its retrieval must be cheap (which may make the insertion less cheap...)

Disk space keeps growing, still the disks can handle only so many IO
operations per second because moving IO heads along the disk surface and
spinning the disks themselves do take roughly the same time now that they
took 10 years ago.   Thus a single terabyte disk is no _faster_ to do IOs
than single 10 GB disk.

One needs to have multiple disks for: data mirrors so that single disk can
fail without data loss or even service loss, _and_ for IO parallellism.

> With your distributed system, how do you handle a guy that travels  
> from the area covered by one server to another? There are lots of  
> details you need to address...

Same data must be replicated at multiple systems either because of data
replication, and because of indexing to answer "what were stations near
OH2MQK's position on date NN" - the lookups could be: "OH-databases", and
"Eastern Canada -databases".

Pretty soon things degenerate to: "have all data at all nodes", which just
goes to parallel server's load-balance.  However if all nodes do not
get all APRS packets, there can be awkward holes in the views of the world.

To ensure that all packets make it to all nodes, the way is to connect
each of the data collector to all APRS-IS core nodes to pull in all data.
... which is rather stupid thing to do because of the core load it causes
when done in large scale setup.

Alternate would be to query all partitioned database nodes for relevant
data, and then do merge-unique before giving out presentation, but I do
recall that goal was to _reduce_ the amount of network traffic in system,
and for globally distributed system things do get a bit sticky when
backends have to do global lookups..   

> Steve K4HG

73 de Matti, OH2MQK

PS: Steve, do check DNS A records of  findu.com  and  www.findu.com