[nos-bbs] Analysis of recent frequent netrom related crashes

Fri Nov 18 14:42:59 EST 2022

Interesting enough the crashing 'seems' to have stopped.

All of this started a while ago after I added a new wormhole
to another system (FBB over BPQ netrom), but that system seems
to have suddenly disappeared from the ether. I think they are
having amprnet connectivity issues, so this will be much more
difficult to track down now as I don't have a source to figure
this out with. I am trying to track down the version of BPQ,
hoping it will help me figure out what to do on my end.

There is something about the netrom traffic or states that is
causing JNOS to crash in the NR4 level code, but I have yet to
figure it out ... it's very confusing what is going on ...

Maiko / VE4KLM

On 2022-11-13 11:12, Maiko (Personal) wrote:
> Okay, last one for now, and learning as I go ...
> 
> Perhaps I need to set the NR4TDISC a lot lower (default) ?
> 
>    jnos> netrom tdisc
>    NR4 redundancy timer (sec): 120
> 
> Experiences anyone ? But still, even with a smaller timeout value,
> there is a 'risk' of a crash, making me think the current way of
> doing a circuit table lookup and reusing entries, seems not be
> the brightest way of doing it ? thinking a 'rewrite', ugh, no.
> 
> Maiko / VE4KLM
> 
> On 2022-11-13 10:47 a.m., Maiko (Personal) wrote:
>> I am guessing (hopefully this shows up in my debugs) ...
>> 
>> IF the local side requests a netrom layer 4 disconnect, then JNOS
>> should probably free the callback there and then, instead of waiting
>> for the final disconnect (which may not get to us). I figure it would
>> not hurt to remove it at that point, since effectively it is done.
>> 
>> I could put in a timer based garbage collection, but I think it's
>> best to get rid of the callback data ASAP or else it will crash.
>> 
>> Anyways ...
>> 
>> Maiko / VE4KLM
>> 
>> On 2022-11-13 10:37 a.m., Maiko (Personal) wrote:
>>> Good morning,
>>> 
>>> Slightly technical post ...
>>> 
>>> This has been driving me nuts the past few months, it just seems
>>> to have started, perhaps because I took on a new netrom neighbour
>>> or two, I just don't know, but I think I know the reasons for all
>>> the crashes. After a few days of inserting some very heavy debugs
>>> into the code, this is where I am at this morning :
>>> 
>>> JNOS keeps a table of netrom callbacks, the default is 20. When a
>>> new connection happens, it gets put into the table, and when it's
>>> done with, it is supposed to be removed from the table. However,
>>> this removal is ONLY DONE when the state of the connection becomes
>>> disconnected. What is happening, is that it appears the entry in
>>> the table for a specific connection looks valid, but in fact it
>>> has disappeared, but JNOS did not remove it, so crash !!!
>>> 
>>> What this suggests to me is that I did not get the final NETROM
>>> disconnected, so JNOS still thinks the callback data is valid, but
>>> in fact it is not, the memory has disappeared, so what happens is
>>> you get every few days a crash in the nr4subr.c functions, like :
>>> 
>>>    Program received signal SIGSEGV, Segmentation fault.
>>>    0x000000000047fdd9 in match_n4circ (index=23, id=71, 
>>> user=0x2081457
>>>    "\236\226d\240\212\234b\236\226d\240\212\234b", node=0x208145e
>>>    "\236\226d\240\212\234b") at nr4subr.c:138
>>>     138  if ((int)(cb->yournum) == index && (int)(cb->yourid) == id
>>> 
>>> AND
>>> 
>>>    Program received signal SIGSEGV, Segmentation fault.
>>>    0x00007f96411f9780 in __memcmp_avx2_movbe () from /lib64/libc.so.6
>>>    (gdb) where
>>>    #0  0x00007f96411f9780 in __memcmp_avx2_movbe () from 
>>> /lib64/libc.so.6
>>>    #1  0x0000000000482727 in nrresetlinks (rp=0x22c5550) at 
>>> nr3.c:1441
>>>    #2  0x000000000047ca22 in doobsotick () at nrcmd.c:1316
>>> 
>>> It is very consistent, so I am running into cases where I am not 
>>> getting
>>> the final netrom layer 4 disconnect, so the callback remains, but 
>>> JNOS
>>> needs to loop through the whole circuit table to find valid ones to 
>>> match up with, and this invalid one just happens to still be in the
>>> table and kablewee :]
>>> 
>>> Anyways, I hope to have a fix of sorts for this 'soon', very 
>>> frustrating. But again, why has this suddenly started happening
>>> at the frequency it has for the past 3 months, possibly more ?
>>> 
>>> Jack, this is probably what you are experiencing as well.
>>> 
>>> Maiko / VE4KLM
>>>