[aprssig] FINDU Weather Intermittency

Dave Anderson KG4YZY dave at aprsfl.net
Fri Feb 8 16:31:55 EST 2008


Hello, Bob....

Allow me to jump in here and give an explanation of what is occurring here.
This will be a long message, and I apologize for that.  For others aside
from you that will reply to this, I will only continue to be a part of the
discussion if the discussion is civilized, otherwise I will just unsubscribe
from the list.   If any of you would like to discuss this further with me in
private e-mail, you are more than welcome to, my address is dave at aprsfl.net.

This requires a slight bit of history, for this to be understood completely.
The CWOP program (citizen weather observer program) runs "on" top of the
APRS-IS network. While licensed ham radio operators are a portion of the
CWOP participants, the lion share of the participants are not hams, that
participate only via the internet.   For years, this hasn't been a problem,
aside from some of you who've voiced concerns about the "sea of blue" on
your map with all the CW stations.

Up until last year, the tier2 network had been supporting the CWOP program.
They started having stability issues with their systems, and were crashing
enough that the CWOP users were complaining a lot in the forums.    Many
were told to use the core, knowing that the systems that ran the core were
considerably larger servers, not home level PC's.   About this same time,
(these facts are not disputable, I'll pass on the letter Phil sent to Russ
for any who doubt my message here) Phil with tier2 sent a message to the
head of the CWOP program Russ Chadwick, simply stating that tier2 was
pulling support, that it was better for everyone involved.  This was sent
late on a Friday.  

We (the core sysops) found out about this shortly afterwards and jumped into
action.  Some of us who run core servers added additional memory to servers,
and in the case of myself, I dropped $4500 for a new 8 core box to run one
of the core servers, moving my older dual CPU xeon server to a co-lo I have
access to in Texas increasing the core server count from 3 to 4.    We knew
that handling 3500 CPU stations with this much horsepower was just not a
problem.   

Shortly after the CWOP users started connecting to the core, I particularly
being a network admin of a server farm with over 100 systems in it, noticed
several MASSIVE shortcomings in the software these guys use that I knew even
back then was eventually going to become a problem.

I decided to write a whitepaper with the help of the server developer, other
sysops, and the CWOP management to assist developers in writing code that
would be network friendly, as in the past, no such guidance had been given
to these developers.  I'll outline the major flaws here that all needed
addressed immediately to preserve APRS-IS network reliability and server
availability to the CWOP members:

	*  CWOP software was only doing a lookup on a server name in most
cased the first time you loaded the software.  That meant if the IP changed
of the server, or in the case of rotate.aprs.net,  a round robin A record in
DNS for pseudo load balancing was used, that the user would be "stuck" on
one server will they reloaded the software.  This problem represented some
18 of the CWOP apps, including the largest by Davis.

	*  CWOP software was allowing users to set their polling interval
down to 1 minute.... Considering these folks only purpose was to get weather
information to MADIS via the Findu funnel, anything more than every 15
minutes, the interval Findu passes weather data upstream to the NWS, was not
needed.  Steve had prepped Findu for a 5 minute interval after a bad
windstorm went up the Chesapeake, however, the folks at MADIS apparently
didn't want the data faster than that, citing they did not have the
processing power to handle quality-control of the data more than every 15
minutes.  So with this, a recommendation was made to hard code CWOP only
software to never be able to set the interval faster than every 5, as we had
well over 400 of them sending data at 1 minute intervals.

	* CWOP software, and this is the BIGGEST problem, was using the
local computer's CLOCK for the polling interval.  Take UI-View, if you set a
5 minute interval, its interval is based on the load time of the program.
As hams, we've known for years that if we all beaconed based on an exact
time interval on a clock, that the network could not support this.  Well,
CWOP was never told this, so some of the worst possible network programming
occurred, and these software packages if set for every 5 minutes (the
default of most) sent their data at the top of the hour, 5 past, 10 past,
etc.  Keep in mind at the time there were 3500 stations!!!  (now 4500 of
them)   

	* Finally, knowing the network was growing, it was decided that CWOP
members should use cwop.aprs.net as a hostname we created that at first
would mirror rotate.aprs.net, but later when the two needed separated into
separate networks, it was as simple as changing a DNS record for the
non-hams to be moved to a separate network.  (forward thinking)

Well these changes were proposed to the CWOP management.  A gent by the name
of Dave Helms was the point man to the 23 some developers of CWOP software.
He immediately balked at the changes.  Citing he had just gone thru having
all of them move to rotate.aprs.net, and this was too much to insist they
change and that developers would pull support from CWOP if they knew these
many changes were needed.  He further went on to say these "changes" needed
"tested" first (eh?).  Since most of these changes are simply good software
coding, most of which every ham piece of software already did, I couldn't
see the problem.  He finally after much discussion said he'd never pass this
on to the developers, so I gave up trying to complete this document after
the third draft of it was done. 

Jump to December of 07.  By now we're at 4000 CWOP stations, all following
poorly written code, and all of a sudden Christmas hits.  An additional
almost 500 new stations in the period of a week joined CWOP from gifts.  

Any of you running a APRS-IS server if you watch the CPU load of the task
running the server, you'll see on the "5's" your CPU load spikes, and so
does the BPS and PPS of your server.  That's the CWOP users dropping off
their packets.   Some simple math was done, CWOP stations represent a small
1/8 of the APRS objects, however, represent a full 1/4 of the bandwidth.  

All of a sudden the core servers started running into some odd issues.  We
had servers with input queue's stacking up into the hundreds of seconds, and
servers that simply would not even stay running without crashing or locking
up the system at 100%.  Now, it's January, and Pete Loveall, who also has a
day job, is releasing private builds of server code left and right to help
the core network out of this problem.    He finally released what was just
released a few days ago with as many optimizations as could be thought of.
He managed to reasonably get servers from crashing, and keep the queues
under 30 seconds, but we're far from having the problem under control as
testing shows we have loss of packets still occurring.

What's worse is that this is not just affecting the core.  Let me explain
why.  Any server that takes a packet, say APRSfl.net (any tier2 server,
etc), with 150 users (all filtered) gets a CWOP packet.  It takes that -one-
packet and has to multiply it by 150, and then run it thru several hash
tables, and finally the write thread with the filter command will either
send or not send that packet to that user.  So even though APRSfl takes only
30 or so CWOP stations (based on log history), it still goes from running
2-3% cpu load on a dual cpu xeon box to 20-25% on the 5's of the clock.  All
again, due to CWOP software sending their reports in using the computers
clock for the polling time.

This onslaught of traffic is a massive spike to the APRS-IS network every 5
minutes.  I don't care who you are, if you take a full feed, you'll feel the
pinch of this.  During the "drive by 5's" as I've affectionately started
calling them. The APRS-IS servers start dropping packets.  Not 100% sure
why, aside from the fact that ever core server, even fourth with it's 8 core
system, buries the CPU during those 5 minute interval for 30-45 seconds.  

Come to last week.  This had not been posted here as we (Greg, Gerry, Pete
and myself) were working our tails off trying to come up with a solution for
this.  After Dave once again refused to help out here, I took a rather
draconian step to prove a point, and had Steve remove third and fourth (my
two servers) out of rotate.  Well it definitely brought the problem to
light, and in the weather quality forum, many developers contacted me
directly, and most wanted copies of this draft whitepaper once they found
out it existed.  I explained why it hadn't been completed, and Dave Helms
decided to publically accuse me of being the reason why it was not
completed.   Dave was concerned about developers pulling support, when he
should have been worried about the network doing the same.   Well, I went
off the deep end, and pulled my support from CWOP completely at that point.
It, apparently wasn't enough that the NWS had a pool of volunteer sysops
running this network that dump thousands of dollars worth of bandwidth and
hardware and never get reimbursed a penny for doing so, but now, it was my
fault that this whitepaper wasn't done?  I guess he forgot that it was he
that refused to publish it.  

What was worse is that he publically said that if non-hams had to reduce
their polling interval that hams had to do the same.  Excuse me?  CWOP users
are guests on our network.  APRS-IS is built by, ran by, and exists to
support hams.  Non hams' using it are nothing more than guests that use us
as a way to get their weather reports to the NWS.   That offended many hams,
I was at over 200 private e-mails yesterday of hams pulling CWOP support for
that callous attitude that Dave had.

Gerry of first said it's time to get the CWOP users off of the APRS-IS, and
that he would get a non-ham set of servers up and running post haste.
That's a great solution, and definitely gets their traffic off of our
network, but a full 1/5 of the CWOP users still use tier2 when they CWOP
management has been contacting users for a year to move over, so even if a
separate set of servers exists, it'll take forever to get them off our
network given the CWOP management have said they refuse to contact users to
switch to a separate set of servers.

So while this is going on, now I was being blamed for this problem, Dick
from Tier2 decided to go start the tier2 vs core debate in the qc forum
again.  Telling folks there is no need to change their polling intervals,
that nothing is wrong with the software, and to come back to the tier2
servers where all will be well.  

Dick has zero network background and had no idea just how detrimental CWOP
has become to the APRS-IS.  He doesn't realize that a full 3 out of 10
packets that flow thru the core right -now- are not making it end to end on
these 5 minute polling intervals.  I it will not make ANY difference where
these stations -enter- the network, the load is the fact that in the case of
the core, a load of 50 packets per second worth of CWOP stations turns into
10,000 packets per second by the time it's been processed for all the
stations connecting to a core server.   I doesn't, again, matter -where- the
data comes from, since all data flows THRU the core, it's a problem.   CWOP
simply needs to be separated off our network, since they clearly will not
write software that will not detrimentally affect our network, and the CWOP
management views hams as more of the problem here.

So with the whole tier2 vs core issue now having come back to light, the
fact that I alone had to stick my neck out to bring a problem to light, I've
reached a point where I have decided to exit the APRS-IS core network.  

Clearly in the eyes of some 25+ years of networking experience isn't worth
anything.  The fact I have $9000 worth of servers and about 10MB/sec of
bandwidth 24HR a day that I've donated to the cause wasn't good enough
either.  Unlike some who do this with spare gear laying around and use
academic bandwidth at zero cost of their pocket, I run a commercial business
and this -had- a fixed cost to me.  I had no problem doing this!!!  I was
glad to, but when I started being blamed for things I had no control over,
that was it.

So after years of abuse, no appreciation, and finally being accused of
creating a problem I was trying to fix,  I have informed the core server
sysops that I will be shutting off third and fourth come 03/01/08 midnight
EST.

I take with me, when I exit a full 1/2 of the APRS-IS core server capacity.
I know this will be painful to all APRS-IS users.  I only hope there are
some others who are willing to step up and run a core server (even though I
have to warn them it is a zero appreciated job that will take years off your
life from mental abuse).  

I advise the APRS community NOW that future core sysops should not have ANY
affiliation with the CWOP program, or the drastic measure that will
ultimately need taken to filter and block CWOP 100% out of our network, will
never happen.  I personally think this will be the only solution as well,
given current events.    If we want stability to come back to our network,
this is right now the only option.

I'm sorry I have to leave, and I'm sorry to the other core sysops and Pete
for having taken this public, but this dirty laundry needs aired.  The
entire APRS community needs to be aware of how much damage CWOP is causing
to our network.

As I mentioned earlier, I will not participate in this dialog further if
this turns into a flame war, but I will offer my advice, help and assistance
otherwise.  Again, you can contact me privately at dave at aprsfl.net for more
information or a copy of the draft CWOP white paper.


Regards,
Dave Anderson
KG4YZY








	


> Has anyone noticed a problem with FINDU weather Intermittency lately?
> My
> N3OZB-2  weather reports are going out at regular 10 minute intervals
> to
> rotate.aprs.net, but intermittently not showing up on FINDU for periods
> of
> an hour or two to a day. Curiously, a couple of neighboring stations,
> 9654
> & 9692, seem to be having similar interruptions, while nearby 454
> doesn't.
> 
> Could this be a setup problem, internet problem, or IS problem?



-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.





More information about the aprssig mailing list