[nos-bbs] JNOS 2.0k memory issues 1: overgrown mailboxes

Mon Aug 15 13:26:40 EDT 2016

Memory Issue 1:  JNOS using massive (>1G) amount of memory = overgrown
mailboxes:

The JNOS hangs and use of massive (>1GB) amounts of memory turned out to be
a problem with JNOS corrupting/overgrowing (for lack of a better word) one
or more mailboxes.  On the machines exhibiting the hangs, I discovered that
one or more JNOS mailboxes had grown to 100s of MB or even > 1GB.  When a
forwarding session occurred on machines in this condition, the temp files
used by JNOS (/tmp/fileXXXXXX) would also grow to 100s of MB or > 1GB, free
memory would drop to almost nothing, and swap space utilization would start
to creep up.  In all cases that I've seen, the mailboxes were bulletin
mailboxes.  To relieve the symptom, I deleted the mailbox(s)
(mailbox-name.*) and deleted the temp files (/tmp/file*).   After that, JNOS
would behave normally.

Some may recall that I reported this several months ago.  The problem
mailboxes at that time contained multiple copies of each message, making it
enormously large.  But the problem did not happened since then, until this
past week.  It just so happened that it occurred when I updated to 2.0k.
Because I suspected 2.0k, I backed our production machines off of 2.0j.7v.
But it happened again, after that.  Therefore, this is NOT something new in
2.0k.  

I don't have any hard evidence because I don't know how to replicate the
problem.  But one suspect is the "expire future" function.  If I remember
correctly, the first occurrence of this problem was when Maiko first updated
expire future to deal with different time zones.  Perhaps something in the
way the indexes were updated created a side-effect?  This is pure
speculation.  But it's based on the following sequence of events.  

*  I had expire future turned off because it was ignoring the timezone and
"future"-ing legitimate messages

*  Maiko did some work on expire future to deal with different timezones

*  I turned on expire future

*  I *THINK* that the first occurrence of the problem happened after that
(but I don't recall exactly and I didn't correlate the two issues at the
time)

*  I turned off expire future because the first fix didn't quite work

*  I didn't experience any problems for months

*  Maiko fixed expire future in 2.0k

*  When I updated to 2.0k, I also turned on expire future

*  When I experienced the hangs, I backed off to 2.0j.7v, but left expire
future turned on

*  I had another mailbox overgrowth with 2.0j.7v

Workaround:

I created a shell script that checks the JNOS mailbox sizes and sends mail
if any mailbox is larger than a specified size (a few MB).  It is run by
cron every hour.  I'm hoping this will allow me to monitor the situation
before it gets too large.  

Suggestions:

1)      As previously requested, change the name of the temp files used by
jnos to show that they below to jnos.  Something like "/tmp/jnos-XXXXXX"
would be more appropriate than the current /tmp/fileXXXXXX.  This would at
least let us know where the problem file is coming from.

2)      Perhaps JNOS could be enhanced with configurable warning and error
levels of max mailbox size.  Upon reaching the warning size, it would log a
warning.  Upon reaching the max error size, it would log an error and not
add anymore to the mailbox.  This wouldn't fix the problem, but it would
prevent the problem from becoming a hang.  

3)      Tracking the root cause will be difficult without knowing how to
reproduce the problem.  Perhaps some added diagnostic logging is in order?

Michael

N6MEF

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tapr.org/pipermail/nos-bbs_lists.tapr.org/attachments/20160815/e3b07f5a/attachment.html>