[nos-bbs] My response - JNOS not being a race car, having stress problems.

(Skip) K8RRA k8rra at ameritech.net
Thu May 31 15:15:00 EDT 2007


Now THAT was a LOT of heat for what I wrote yesterday.
I'd like to fill in a few "blanks".
AND 
I'd like to put emphasis back on jnos *only* while jnos might be appear
to be "sickly".  Healthy jnos is the norm, yet I had to deal with a bad
outcome for parameters I set (or mis-set) in the validation suite I'm
creating...

For any new or casual users:  you have a better chance at the lottery
than to see what I saw?

On Thu, 2007-05-31 at 01:33 -0400, Jay Nugent wrote:
> Greetings,
> 
> On Wed, 30 May 2007, Maiko Langelaar (ve4klm) wrote:
> 
> > Hi all,
> > 
> > I personally do not want everyone on this list to be distracted or scared
> > away from using JNOS to it's full potential, just because of a few recent
> > observations and comments, which if misread, will leave new and potential
> > NOS users with a not so great impression of what it is truly capable of.
> 
>    Agreed!
JNOS certainly has immense capability, I remain baffled over the
relative success of lesser software in the presence of jnos.  If there
is a lesson for me, it is to establish a reasonable expectation of jnos
performance capability and tools to deal with any shortfall experienced
from time to time.  And this is not comparative to other software
options, jnos certainly can stand on it's own.

Still at the risk of being misread again, it seems to this writer that
jnos has a potential to respond in an erratic way when jnos is abused by
mis-setting perhaps only one tuning parameter.

Jnos certainly has a history at my site, and also my neighbor.  Jay, you
have been party to it over the past months, one good example is the
convers bridge net you held a month(+) ago when this station could not
be heard by others, and this station only heard a fraction of net
activity.
>  
> > > domain.txt has the characteristic of (self?) destruction from unknown
> > > causes at this time. It is likely a program instability ...
> > 
> > Why is it "likely" a program instability ? This stuff has been running
> > for YEARS, and hardly changed over that time. It's more or less proven
> > to work way before I even got involved in it.
> 
>    I have seen domain.txt corrupted but ONCE since I first started running 
> it back in 1992.  ONE hosed record in 15 years is certainly not 
> "unstable".
The words "program instability" are inflammatory to you, and are
potentially overstated without having background.  Let's consider
another way of describing the effect.

Both my station and my neighbor have a long history (year+) of
continuous up-time.  If you study the monitor quoted later, and link
re-boots to reasons, VERY FEW have jnos stability as a cause.  Likewise
with RF link poor performance we transacted little traffic.  In the time
I asked a lot of questions that led to no solution.

With the push to identify, isolate, and correct the perceived problem I
forced up traffic volume arbitrarily and the effect seemed to be:
 a) further depressed data flow over the link, near standstill.  The
harder I pushed the slower it got.
 b), c), d), ... [see the list below so as to not duplicate]

Interestingly enough, when I tried to "patch" domain.txt with vi, jnos
did NOT do well with the result.  Upon replacing domain.txt with the
standard mi-drg version all went well.  Crash may be related to a bad
domain.txt.  That is pretty direct, but bad data causing crashes is
never a good thing?  

Happily, crashes have again ceased (so far) that point (me) to a
relationship between crash and a good domain.txt.  A good domain.txt
stopped crashes, and while "domain update off" remains in force there is
a lock on the door.  This does little to isolate the root cause, but
does point to conditions that allow success.

Now for the coincidence of no crashes while not pushing traffic volume,
it seems fair to draw the conclusion that the flaw in domain.txt is a
consequence of my testing program.  I expect detail questions over this
- worthy of a search for a root cause.

The unexplainable conditions at my site persisted until the data
transfer went smoothly after parameters were corrected.  Likewise the IP
error persisted until after parameters were reset and domain.txt was
repaired.  It may be important to note that DNS services for my site
pass thru my neighbor site, but do not depend on his domain server.

These actions were in concert with parameter adjustment and not alone,
thus you might conclude it has a separate cause.  However I point to the
success of test-case #1 and subsequent failures of cases #2, #3, and so
forth that lead (me) to the potential of a cause/result pair.

Why all the description?  To allow me to ask for a new set of words (not
program instability) that fairly describe a situation where unrelated
observable results are tied to an known change.  Cause = bad parms +
high traffic load / result =? bad domain.txt followed by crashes
concurrently with bad IP and short lists.

This is not a search for a root cause, a well formed testing program is
needed for that.  This is not an answer to setting parameters, that is
done elsewhere.  This is not about slowing down under load.  This is not
about achieving long term up-time.  This IS a statement that jnos SEEMS
TO act in irrational ways when it is "abused" with poorly set
parameters. 

In addition, I am hard pressed to find that the hardware caused a normal
problem, or that the OS caused a predictable problem, or that RF fried
memory, or that the hard drive read differently than it wrote, and so
forth...  

Last, I have many years behind me where in any one event a few causes
manifest themselves in both predictable and unpredictable ways.  I'm not
deterred to list the causes and results in the same thought even though
the chain of events is not fully defined yet.  Old habits?
> 
>  
> > > This is a suggestion to NOT try to get the "last ounce" of performance
> > > from jnos installation because jnos is not a race car.
> > 
> > I don't think that's fair, and anyone reading this (new JNOS users in
> > particular) are not going to get the best impression from reading this.
> > 
> > You may be surprised to learn that JNOS is indeed a race car, BUT like
> > all vehicles, if you load it down too much, it slows down. I'm sure if
> > one has a kick ass machine, that will definitely help things out.
> 
>    Ran my first JNOS box on a 386 40MHz with only ONE meg of RAM.  That
> was Hamgate.Merit.EDU and the machine had 4 radio ports plus a very busy
> ethernet port.  It pushed MOST of the Internet mailing lists such as
> NOS-BBS, various lists from TAPR, Kep tables, and the usual load of
> bulletins and personnal email for Hams living in Eastern Michigan from
> Port Huron down to Toledo, Ohio.  Daily logs showed well in excess of 500
> messages per day, some SMTP and some heirarchical mail being passed to
> conventional BBS's.  I'd say it was chugging along quite well, race car or
> not!
It seems that "race car" is a bad analogy.  The issue here is not about
slowing down, the underlying issue here is about capturing a little
bandwidth with other unrelated users who have (intentionally?) saturated
the channel and presented only very small opportunity (time slices) to
squeeze in.  For me the situation offered an opportunity to excel rather
than an reason to complain.  We hams are supposed to have equal access
to bandwidth?

I can confirm that data flows nicely over channels that are populated
with jnos hosts which are reasonably tuned.  The flip side of that coin
is that maximizing data flow involves interrelationships between
protocols and TNCs that make network experience in the trenches quite
valuable.

"Pretty good" performance is pretty easy in a group of pretty
like-minded individuals.  Default values do a lot for this, and mi-drg
experience supports a fine set to run with.  Perhaps my race car example
where win/lose can be separated by a millisecond, is better stated as
cut-and-run to a reasonable highway with rules rather than to stay on a
race track...?
> 
> 
> > I've run JNOS (in DOS) as a router that had uptimes of over a month, I
> > know of people that have had uptimes of over half a year. No application
> > servers running of course, just packet switching more or less.
> 
> hamgate.wayne49.ampr.org
> KA8E-4    Dearborn    (44.102.49.1)  :  29.4 ms   194256  236:04:34:13
> eocway.ampr.org
> WA8EOC    EOCWAY      (44.102.48.88) :  6910.2 ms 227104  201:10:48:44
> bbs.n8kuf.ampr.org
> N8KUF-4   Monroe      (44.102.56.33) :  5735.7 ms 66784   104:19:55:03
> 
>    So hows that for uptimes!  236 days, 201 days, and 104 days.  The 
> latter two are located at remote sites and managed over RF.
Jay, I read the same report and see the same results.  Don't you think
success is based on well done installations?  Of the 18 in the list
today the mean is 38 days with std.dev. of 68 and we have the bottom
(worst up-time) well covered.  In the same breath, it seems our hunt is
over yet the pain of getting there remains on our minds?  If anything,
mi-drg might devote a PROMINENT spot on the website to publish the
quoted success stories complete with pictures, schematics, equipment
lists, AND jnos parameters?  Good for business...
> 
>  
> > > It is my opinion that jnos has an internal processing logic problem that
> > > manifests itself only under "stress".
> 
>    Well if the old Hamgate.Merit.EDU box wasn't under "stress", I don't
> know what is.  My suspicion is that disk errors are more likely due to RF
> getting into the box, or a weak power supply.  But JNOS itself is and has
> been incredibly stable :-)
This event that made all these words might make a case that I simply
don't know what I am doing.  Make it if you are interested.  On the
other hand, if you want to attract new users perhaps making application
issues disappear deserves a prominent piece in that plan.  I have
started the wiki to deal with application rather than development that
could be a piece in the puzzle.  Further I have the intent of turning it
over to Maiko, Jay, or perhaps another key player in ampr.org when it
has reached a little maturity and audience.  

Maiko makes many other good points that I choose to not quote here, and
again his focus is on a well founded site in probably a well coordinated
network.

***
BUT mine (ours?) has not been a well-founded site for some time.  It's
better now.
***

Perhaps one embarrassing point is that I can not explain in detail what
choices were made over the past year that got to the point we found
ourselves a couple weeks ago.  You need to appreciate I (we?) are living
proof that jnos responded in irrational ways to what I have called
"stress" before.

IRRATIONAL?  In response to more traffic:
 1) FTP began nicely, but developed a nasty habit of instantiating ftpd
with a bogus IP number.
 2) The "tcp view" command limited itself to 3 entries when perhaps as
many a 7 sockets were worthy of reporting.
 3) The other end of the RF link began crashing.
 4) domain.txt at the other end contained bogus entries that tied to the
bogus FTPD IP and the mis-set parameter.
 5) Several parameters at both sites required change, but the worst
errors fell to "TCP BLIMIT = 2" and "IP RTIMER = 4" that are totally
unrelated to FTPD IP, tcp view, and domain.txt on a linked site.
 6) Until parameter changes in this effort required segmentation the
problem was only annoying, however with changes to parmaters from
original values toward todays target values, segmentation began and the
problem became quite intolerable.

Now the parms are changed, domain.txt is changed, and jnos handles more
traffic like a champ with all the above irrational stuff GONE!

But I'm repeating myself?  Sorry.  If jnos were not very fine software I
would not be using it.  If jnos had one concise and bullet-proof user
document I would not be doing this.  *IF* I am correct, then I should be
able to recreate this behavior in a controlled environment and produce
diagnostic proof that CAUSE A produces RESULT B that is found in
unrelated aspects of the software.  Further that RESULT B stands for BAD
deserving of software change to trap and error message or otherwise
eliminate the behavior.

Because RESULT B can be avoided with proper parameter setting, I favor
letting go of this issue for the time being.  WIKI treatment is in the
works for reasonable parameter settings - won't that be a good enough
work around?

So I hope I'm (we're) done with this topic and we can happily get on...

73
de [George (Skip) VerDuin] K8RRA k





More information about the nos-bbs mailing list