[nos-bbs] Generic malloc failure -- why?

Tue Sep 11 12:16:52 EDT 2007

Hi Skip,

> Nope - it remains running and generates a string of messages, one per
> line exactly as formatted, I don't know on what frequency...

IF jnos CONTINUES TO RUN, it can only mean one of the following :

1 User or sysop is trying to dump the ARP table or any NETROM info to
   their display to see what's there, but JNOS isn't able to obtain the
   memory required to do it.

2 Your popmail service is trying to create a control block, but is not
   able to get the memory it needs to do that.

3 If you are using the JNOS ppp interface, it's having trouble getting
   memory from the system.

4 Your smtp service is trying to create a control block, but it is not
   able to get the memory it needs to do that.

5 You are trying to 'trace' a particular JNOS interface, but JNOS is
   not able to allocate memory to that.

I don't know. Is it possible that you are getting inundated with too
many POP or SMTP requests, so many that perhaps memory is getting a bit
too fragmented and your only resort is to restart JNOS ?

How much NETROM activity do you have on your system, how many nodes are
in the table at any particular time. Just thinking outloud. Of course, one
could get carried away I suppose and do something like this :

  If MALLOC_CHECK_ is set to 0, any detected heap corruption
  is silently ignored; if set to 1, a diagnostic is printed
  on stderr; if set to 2, abort() is called immediately.
  This can be useful because otherwise a crash may happen
  much later, and the true cause for the problem is then very
  hard to track down.

In other words, 'setenv MALLOC_CHECK 1', then run JNOS in that
shell environment, see if any diagnostics come up to say anything.

>>> Thus far I can't tie the message to any particular event.

Hopefully the above 5 points will help you focus on the particular
event that is giving you the messages.

> Actually I'm surprised at fragmented as a reason, I would expect malloc
> to be pretty aggressive at shifting stuff around to get space for a
> request.  Could jnos be asking for an unusually large chunk that needed
> to be in real space and not be swappable?

JNOS is asking linux for memory (via the system call, malloc). JNOS could
not care or does it even know the mechanics behind the call. But this is
an interesting thing you ask. I've read somewhere that perhaps one should
not exit immediately from a failed request for malloc() if the system is
being tasked at the time. One could instead wait (in jnos case do a pwait)
then try again several times to malloc for memory, then fail after several
attempts, not just the first attempt. I can play around with that on my
development system, it sounds like an interesting idea ...

Maiko Langelaar / VE4KLM