[nos-bbs] crashes (netrom) have mysterious stopped

Fri Dec 9 13:36:34 EST 2022

It sounds like you we’re having the same problem as I was.
Only in my case jnos might run for 15 minutes. I too couldn’t find the
problem. Then I happened to notice that the three links I had to Quebec
were mysteriously causing it to crash. For whatever reason the owner of
them was having everything run through his bpq and as a consequence when I
tried to get to his jnos I couldn’t because it was hiding behind bpq and I
told him about it. His jnos was trying to communicate directly to me. I
don’t think that was the only case but it certainly was a contributing
factor. So i dropped all three links and have had minimal issues since. I
told him once he fixed his problems I’ll allow the links again. I have yet
to hear from him.  I noticed a couple of weeks ago his bpq is using the
forwarding to me  but so far okay.
73, Don

On Fri, Dec 9, 2022 at 11:11 AM <maiko at pcsinternet.ca> wrote:

> Good day,
>
> Shaking my head I am, crashes have mysteriously disappeared, and
> so far my uptime has been 11 days and counting. The crash dumps
> tell me where the crash occurs, but it make no sense.
>
> Perhaps a link to another system that came and went over the
> past few months ? perhaps malformed netrom packets, or netrom
> code not dealing properly with a netrom feature that is rarely
> used (IP over NETROM for example), could be stack related, this
> one is a doozy, welcome to the world of software development.
>
> So now when you want a crash, you don't get one ...
>
> Maiko / VE4KLM
>
> On 2022-11-19 11:23, Maiko (Personal) wrote:
> > Nothing like jinxing yourself, forget it, it gets fixed when
> > it gets fixed. No idea now as to why, could be newer compiler
> > for all I know (I did upgrade my OS a while ago), so perhaps
> > it's exposing something in the JNOS code, sigh, not the 1rst
> > time it's happened.
> >
> > Last post on this, sorry for filling up your mailboxes :]
> >
> > M
> >
> > On 2022-11-18 1:42 p.m., maiko at pcsinternet.ca wrote:
> >> Interesting enough the crashing 'seems' to have stopped.
> >>
> >> All of this started a while ago after I added a new wormhole
> >> to another system (FBB over BPQ netrom), but that system seems
> >> to have suddenly disappeared from the ether. I think they are
> >> having amprnet connectivity issues, so this will be much more
> >> difficult to track down now as I don't have a source to figure
> >> this out with. I am trying to track down the version of BPQ,
> >> hoping it will help me figure out what to do on my end.
> >>
> >> There is something about the netrom traffic or states that is
> >> causing JNOS to crash in the NR4 level code, but I have yet to
> >> figure it out ... it's very confusing what is going on ...
> >>
> >> Maiko / VE4KLM
> >>
> >> On 2022-11-13 11:12, Maiko (Personal) wrote:
> >>> Okay, last one for now, and learning as I go ...
> >>>
> >>> Perhaps I need to set the NR4TDISC a lot lower (default) ?
> >>>
> >>>    jnos> netrom tdisc
> >>>    NR4 redundancy timer (sec): 120
> >>>
> >>> Experiences anyone ? But still, even with a smaller timeout value,
> >>> there is a 'risk' of a crash, making me think the current way of
> >>> doing a circuit table lookup and reusing entries, seems not be
> >>> the brightest way of doing it ? thinking a 'rewrite', ugh, no.
> >>>
> >>> Maiko / VE4KLM
> >>>
> >>> On 2022-11-13 10:47 a.m., Maiko (Personal) wrote:
> >>>> I am guessing (hopefully this shows up in my debugs) ...
> >>>>
> >>>> IF the local side requests a netrom layer 4 disconnect, then JNOS
> >>>> should probably free the callback there and then, instead of waiting
> >>>> for the final disconnect (which may not get to us). I figure it
> >>>> would
> >>>> not hurt to remove it at that point, since effectively it is done.
> >>>>
> >>>> I could put in a timer based garbage collection, but I think it's
> >>>> best to get rid of the callback data ASAP or else it will crash.
> >>>>
> >>>> Anyways ...
> >>>>
> >>>> Maiko / VE4KLM
> >>>>
> >>>> On 2022-11-13 10:37 a.m., Maiko (Personal) wrote:
> >>>>> Good morning,
> >>>>>
> >>>>> Slightly technical post ...
> >>>>>
> >>>>> This has been driving me nuts the past few months, it just seems
> >>>>> to have started, perhaps because I took on a new netrom neighbour
> >>>>> or two, I just don't know, but I think I know the reasons for all
> >>>>> the crashes. After a few days of inserting some very heavy debugs
> >>>>> into the code, this is where I am at this morning :
> >>>>>
> >>>>> JNOS keeps a table of netrom callbacks, the default is 20. When a
> >>>>> new connection happens, it gets put into the table, and when it's
> >>>>> done with, it is supposed to be removed from the table. However,
> >>>>> this removal is ONLY DONE when the state of the connection becomes
> >>>>> disconnected. What is happening, is that it appears the entry in
> >>>>> the table for a specific connection looks valid, but in fact it
> >>>>> has disappeared, but JNOS did not remove it, so crash !!!
> >>>>>
> >>>>> What this suggests to me is that I did not get the final NETROM
> >>>>> disconnected, so JNOS still thinks the callback data is valid, but
> >>>>> in fact it is not, the memory has disappeared, so what happens is
> >>>>> you get every few days a crash in the nr4subr.c functions, like :
> >>>>>
> >>>>>    Program received signal SIGSEGV, Segmentation fault.
> >>>>>    0x000000000047fdd9 in match_n4circ (index=23, id=71,
> >>>>> user=0x2081457
> >>>>>    "\236\226d\240\212\234b\236\226d\240\212\234b", node=0x208145e
> >>>>>    "\236\226d\240\212\234b") at nr4subr.c:138
> >>>>>     138  if ((int)(cb->yournum) == index && (int)(cb->yourid) == id
> >>>>>
> >>>>> AND
> >>>>>
> >>>>>    Program received signal SIGSEGV, Segmentation fault.
> >>>>>    0x00007f96411f9780 in __memcmp_avx2_movbe () from
> >>>>> /lib64/libc.so.6
> >>>>>    (gdb) where
> >>>>>    #0  0x00007f96411f9780 in __memcmp_avx2_movbe () from
> >>>>> /lib64/libc.so.6
> >>>>>    #1  0x0000000000482727 in nrresetlinks (rp=0x22c5550) at
> >>>>> nr3.c:1441
> >>>>>    #2  0x000000000047ca22 in doobsotick () at nrcmd.c:1316
> >>>>>
> >>>>> It is very consistent, so I am running into cases where I am not
> >>>>> getting
> >>>>> the final netrom layer 4 disconnect, so the callback remains, but
> >>>>> JNOS
> >>>>> needs to loop through the whole circuit table to find valid ones to
> >>>>> match up with, and this invalid one just happens to still be in the
> >>>>> table and kablewee :]
> >>>>>
> >>>>> Anyways, I hope to have a fix of sorts for this 'soon', very
> >>>>> frustrating. But again, why has this suddenly started happening
> >>>>> at the frequency it has for the past 3 months, possibly more ?
> >>>>>
> >>>>> Jack, this is probably what you are experiencing as well.
> >>>>>
> >>>>> Maiko / VE4KLM
> >>>>>
>
> _______________________________________________
> nos-bbs mailing list
> nos-bbs at lists.tapr.org
> http://lists.tapr.org/mailman/listinfo/nos-bbs_lists.tapr.org
>
-- 
Regards,
Don
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tapr.org/pipermail/nos-bbs_lists.tapr.org/attachments/20221209/7e171457/attachment.html>