[ts-7000] Re: TS-7500 Memory Loss

To:
Subject:	[ts-7000] Re: TS-7500 Memory Loss
From:	"al" <>
Date:	Tue, 11 Sep 2012 18:39:31 -0000

We have completed our testing and have shown conclusively that there is a bug 
in Linux that creates what appears to be a memory leak or loss.  The bug is not 
in the Ethernet stack as we had suspected, but in a particular section of the 
Linux memory manager.

We created a generic process with two threads, one for a basic client and the 
other for a basic server.  We also created two Windows applications to interact 
with the threads and communicate messages at a high rate so that failure would 
not take long to recreate (typically less than 3 hours on a TS-7500).  We used 
a memory logger running on a RW partition on, and recording memory stats to, 
the SD card on the TS-7500.  The logger records memory statistics every 5 
minutes.

We saw our TS-7500 crash after about 2 hours and 45 minutes at which point it 
did not respond to ANY form of communication, including the serial console.  
Our memory logger showed rapid decrease in free memory and a rapid increase of 
used and slab memory.  The graph of memory stats showed the system releasing a 
small amount of memory when free memory dropped below about 1.5 MB after which 
the free memory continued to decrease at the same rate.  It did this twice, but 
when free memory hit about 1.2 MB, the last entry in the log, the system hung.  
After reading up on Linux memory management, we confirmed that Linux is 
designed to hoard memory and that storage is 'slab' (page) based.  The network 
stack is not designed to work from a pre-allocated memory area, as embedded 
operating systems do.  It allocates memory from the 'heap' for all socket 
descriptors, file descriptors, data buffers, etc.  This is why memory is so 
quickly fragmented.  The memory allocated is NOT released back to the heap when 
a socket is closed or data packet has been processed.  This is why free memory 
decreases so rapidly.  But, the worst design is in the slab manager, kswapd.  
As slabs are marked as full (fully allocated), they are placed in a non-cached 
list of slabs that are released only partially when free memory drops below a 
certain limit (~1.2 MB in our experience).  It is a very stingy routine and 
does not release enough memory to prevent the system from killing off tasks or 
crashing.  That is the root cause of the memory 'loss' - the hoarding and 
mismanagement of memory by kswapd.

The good news is that an arcane command sequence was found that forces the 
kernel to release its cached memory.  This was very useful in that it allowed 
us to run a background task to periodically issue the command sequence, forcing 
the release of memory fast enough that it did not contribute to filling slabs.  
Once a slab is filled, it cannot be freed by the command sequence since the 
sequence is for cached memory only.  The trick is to release cached memory at 
least as fast as it is consumed.  Using this method, we got results from a 
recent test showing ZERO loss of memory over a 15 hour period.

The background task issues a system call every 5 minutes using this sequence:

sync; echo 3 > /proc/sys/vm/drop_caches

This should help Paul "ptreos2" work around the problem he is having.

If I can figure out how to post pictures to this group, I will post graphs of 
memory usage over days of typical operation before and after the workaround.

Mitch

--- In  "al" <> wrote:
>
> Thank you, Paul!  I have started a new round of testing that includes 
> periodic sampling of output from the 'free' command and the SUnreclaim line 
> from /proc/meminfo.
> 
> The test I just finished showed that Linux grabs all but the last ~1.3 MB of 
> available memory at which point it frees about 1.1 MB and continues running, 
> using memory in blocks of 12 KB.  This just repeats over and over.  When our 
> application failed, there was still plenty of memory left.
> 
> So, memory leaks do not appear to be causing our problem.  Nor does an errant 
> pointer - if the application had strayed out of bounds, we would have seen a 
> segment violation message.  In fact, we see no messages at all, a sign the 
> application was killed via a signal from the kernel.
> 
> I tried compiling with debug information and starting the application via 
> GDB, but the system was so overloaded that it failed within seconds of 
> starting.  The sbuslock() assert("r == 0") call was made, showing a 
> semop(SEM_UNDO) failure.  Without GDB, the assert is never seen, so my next 
> test will be without it.
> 
> I will notify the group of any other findings I stumble upon.  ;-)
> 
> Regards, Mitch
> 
> 
> 
> --- In  "ptreos2" <ptre@> wrote:
> >
> > 
> > We see this same problem on the same HW and linux kernel. The only network 
> > activity we have is for a web server. If no browser points to the web 
> > server we stay up forever. Put a browser on it which continuously updates 
> > pages we run out of free memory within 20 hours.
> > 
> > Look at /proc/meminfo and you will see SUnreclaimable steadily increasing. 
> > This is the kernel leaking memory. However it doesn't happen all the time. 
> > It seems one must first see an nbd error in syslog. The error I see looks 
> > like this:
> > 
> > Jul 17 14:22:14 ts7500 kernel: [ 6061.930000] nbd1: Other side returned 
> > error (1)
> > Jul 17 14:22:14 ts7500 kernel: [ 6061.930000] end_request: I/O error, dev 
> > nbd1, sector 1073741696
> > Jul 17 14:22:14 ts7500 kernel: [ 6061.930000] Buffer I/O error on device 
> > nbd1, logical block 134217712 
> > 
> > Only after this error occurs do I then see the kernel memory leak.
> > I am talking to a tech rep about this issue but is going to take some time 
> > to resolve. We may need to go to a new kernel although I would much rather 
> > have a field deployable fix.
> > 
> > Paul
> > 
> > 
> > --- In  "al" <mitch.stanek@> wrote:
> > >
> > > We are using a TS-7500 for a network-intensive application that sends and 
> > > receives multiple messages (TCP) per second.  The application does no 
> > > run-time memory allocation; all allocation is done at power-up.  The 
> > > application is supposed to run continuously for months, but our tests 
> > > show it stops running after about 4 days and has a memory leakage of 
> > > about 2 MB/day!  The TS-7500 has Linux 2.6.24.4.
> > > 
> > > When the application starts, 'free' reports 1.4 MB free + 15 MB 
> > > cache/buffer.  Just before the application is killed by Linux (4.5 days 
> > > later), the free memory is 1.3 MB + 7.2 MB cache/buffer.
> > > 
> > > It appears that the Linux network stack has some serious memory leaks.
> > > 
> > > We are using the fast-boot option, running the Busybox Linux that comes 
> > > in the Flash memory.  We tried the slow-boot option using the Linux 
> > > 2.6.24 kernel, but got the same results, so it seems endemic to the Linux 
> > > network stack.  To ensure the application is not leaking memory, we will 
> > > run Valgrind.
> > > 
> > > In the mean time, has anyone in this group encountered this memory 
> > > leakage in network-intensive applications?  If so, were you able to fix 
> > > it?
> > > 
> > > Kind regards,
> > > Mitch
> > >
> >
>

------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ts-7000/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/ts-7000/join
    (Yahoo! ID required)

<*> To change settings via email:

<*> To unsubscribe from this group, send an email to:

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

<Prev in Thread]	Current Thread	[Next in Thread>
[ts-7000] Re: TS-7500 Memory Loss, al <= [ts-7000] Re: TS-7500 Memory Loss, al

Previous by Date:	[ts-7000] Re: SSH without password, damotclese
Next by Date:	[ts-7000] Re: TS-7500 Memory Loss, al
Previous by Thread:	[ts-7000] SSH without password, damotclese
Next by Thread:	[ts-7000] Re: TS-7500 Memory Loss, al
Indexes:	[Date] [Thread] [Top] [All Lists]

Disclaimer: Neither Andrew Taylor nor the University of NSW School of Computer and Engineering take any responsibility for the contents of this archive. It is purely a compilation of material sent by many people to the birding-aus mailing list. It has not been checked for accuracy nor its content verified in any way. If you wish to get material removed from the archive or have other queries about the archive e-mail Andrew Taylor at this address: andrewt@cse.unsw.EDU.AU