[nos-bbs] crashes (netrom) have mysterious stopped

Fri Dec 9 11:10:32 EST 2022

Good day,

Shaking my head I am, crashes have mysteriously disappeared, and
so far my uptime has been 11 days and counting. The crash dumps
tell me where the crash occurs, but it make no sense.

Perhaps a link to another system that came and went over the
past few months ? perhaps malformed netrom packets, or netrom
code not dealing properly with a netrom feature that is rarely
used (IP over NETROM for example), could be stack related, this
one is a doozy, welcome to the world of software development.

So now when you want a crash, you don't get one ...

Maiko / VE4KLM

On 2022-11-19 11:23, Maiko (Personal) wrote:
> Nothing like jinxing yourself, forget it, it gets fixed when
> it gets fixed. No idea now as to why, could be newer compiler
> for all I know (I did upgrade my OS a while ago), so perhaps
> it's exposing something in the JNOS code, sigh, not the 1rst
> time it's happened.
> 
> Last post on this, sorry for filling up your mailboxes :]
> 
> M
> 
> On 2022-11-18 1:42 p.m., maiko at pcsinternet.ca wrote:
>> Interesting enough the crashing 'seems' to have stopped.
>> 
>> All of this started a while ago after I added a new wormhole
>> to another system (FBB over BPQ netrom), but that system seems
>> to have suddenly disappeared from the ether. I think they are
>> having amprnet connectivity issues, so this will be much more
>> difficult to track down now as I don't have a source to figure
>> this out with. I am trying to track down the version of BPQ,
>> hoping it will help me figure out what to do on my end.
>> 
>> There is something about the netrom traffic or states that is
>> causing JNOS to crash in the NR4 level code, but I have yet to
>> figure it out ... it's very confusing what is going on ...
>> 
>> Maiko / VE4KLM
>> 
>> On 2022-11-13 11:12, Maiko (Personal) wrote:
>>> Okay, last one for now, and learning as I go ...
>>> 
>>> Perhaps I need to set the NR4TDISC a lot lower (default) ?
>>> 
>>>    jnos> netrom tdisc
>>>    NR4 redundancy timer (sec): 120
>>> 
>>> Experiences anyone ? But still, even with a smaller timeout value,
>>> there is a 'risk' of a crash, making me think the current way of
>>> doing a circuit table lookup and reusing entries, seems not be
>>> the brightest way of doing it ? thinking a 'rewrite', ugh, no.
>>> 
>>> Maiko / VE4KLM
>>> 
>>> On 2022-11-13 10:47 a.m., Maiko (Personal) wrote:
>>>> I am guessing (hopefully this shows up in my debugs) ...
>>>> 
>>>> IF the local side requests a netrom layer 4 disconnect, then JNOS
>>>> should probably free the callback there and then, instead of waiting
>>>> for the final disconnect (which may not get to us). I figure it 
>>>> would
>>>> not hurt to remove it at that point, since effectively it is done.
>>>> 
>>>> I could put in a timer based garbage collection, but I think it's
>>>> best to get rid of the callback data ASAP or else it will crash.
>>>> 
>>>> Anyways ...
>>>> 
>>>> Maiko / VE4KLM
>>>> 
>>>> On 2022-11-13 10:37 a.m., Maiko (Personal) wrote:
>>>>> Good morning,
>>>>> 
>>>>> Slightly technical post ...
>>>>> 
>>>>> This has been driving me nuts the past few months, it just seems
>>>>> to have started, perhaps because I took on a new netrom neighbour
>>>>> or two, I just don't know, but I think I know the reasons for all
>>>>> the crashes. After a few days of inserting some very heavy debugs
>>>>> into the code, this is where I am at this morning :
>>>>> 
>>>>> JNOS keeps a table of netrom callbacks, the default is 20. When a
>>>>> new connection happens, it gets put into the table, and when it's
>>>>> done with, it is supposed to be removed from the table. However,
>>>>> this removal is ONLY DONE when the state of the connection becomes
>>>>> disconnected. What is happening, is that it appears the entry in
>>>>> the table for a specific connection looks valid, but in fact it
>>>>> has disappeared, but JNOS did not remove it, so crash !!!
>>>>> 
>>>>> What this suggests to me is that I did not get the final NETROM
>>>>> disconnected, so JNOS still thinks the callback data is valid, but
>>>>> in fact it is not, the memory has disappeared, so what happens is
>>>>> you get every few days a crash in the nr4subr.c functions, like :
>>>>> 
>>>>>    Program received signal SIGSEGV, Segmentation fault.
>>>>>    0x000000000047fdd9 in match_n4circ (index=23, id=71, 
>>>>> user=0x2081457
>>>>>    "\236\226d\240\212\234b\236\226d\240\212\234b", node=0x208145e
>>>>>    "\236\226d\240\212\234b") at nr4subr.c:138
>>>>>     138  if ((int)(cb->yournum) == index && (int)(cb->yourid) == id
>>>>> 
>>>>> AND
>>>>> 
>>>>>    Program received signal SIGSEGV, Segmentation fault.
>>>>>    0x00007f96411f9780 in __memcmp_avx2_movbe () from 
>>>>> /lib64/libc.so.6
>>>>>    (gdb) where
>>>>>    #0  0x00007f96411f9780 in __memcmp_avx2_movbe () from 
>>>>> /lib64/libc.so.6
>>>>>    #1  0x0000000000482727 in nrresetlinks (rp=0x22c5550) at 
>>>>> nr3.c:1441
>>>>>    #2  0x000000000047ca22 in doobsotick () at nrcmd.c:1316
>>>>> 
>>>>> It is very consistent, so I am running into cases where I am not 
>>>>> getting
>>>>> the final netrom layer 4 disconnect, so the callback remains, but 
>>>>> JNOS
>>>>> needs to loop through the whole circuit table to find valid ones to 
>>>>> match up with, and this invalid one just happens to still be in the
>>>>> table and kablewee :]
>>>>> 
>>>>> Anyways, I hope to have a fix of sorts for this 'soon', very 
>>>>> frustrating. But again, why has this suddenly started happening
>>>>> at the frequency it has for the past 3 months, possibly more ?
>>>>> 
>>>>> Jack, this is probably what you are experiencing as well.
>>>>> 
>>>>> Maiko / VE4KLM
>>>>>