Issue Details (XML | Word | Printable)

Key: LUCENE-1383
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Work around ThreadLocal's "leak"

Created: 11/Sep/08 08:50 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Index
Affects Version/s: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1383.patch 2008-09-11 09:04 PM Michael McCandless 7 kB
Image Attachments:

1. ScreenHunter_01 Sep. 13 08.40.jpg
(15 kB)

2. ScreenHunter_02 Sep. 13 08.42.jpg
(16 kB)

3. ScreenHunter_03 Sep. 13 08.43.jpg
(16 kB)

4. ScreenHunter_07 Sep. 13 19.13.jpg
(17 kB)

Lucene Fields: New
Resolution Date: 14/Sep/08 10:33 AM


 Description  « Hide
Java's ThreadLocal is dangerous to use because it is able to take a
surprisingly very long time to release references to the values you
store in it. Even when a ThreadLocal instance itself is GC'd, hard
references to the values you had stored in it are easily kept for
quite some time later.

While this is not technically a "memory leak", because eventually
(when the underlying Map that stores the values cleans up its "stale"
references) the hard reference will be cleared, and GC can proceed,
its end behavior is not different from a memory leak in that under the
right situation you can easily tie up far more memory than you'd
expect, and then hit unexpected OOM error despite allocating an
extremely large heap to your JVM.

Lucene users have hit this many times. Here's the most recent thread:

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200809.mbox/%3C6e3ae6310809091157j7a9fe46bxcc31f6e63305fcdc%40mail.gmail.com%3E

And here's another:

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/%3CF5FC94B2-E5C7-40C0-8B73-E12245B91CEE%40mikemccandless.com%3E

And then there's LUCENE-436 and LUCENE-529 at least.

A google search for "ThreadLocal leak" yields many compelling hits.

Sun does this for performance reasons, but I think it's a terrible
trap and we should work around it with Lucene.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless added a comment - 11/Sep/08 09:04 PM
Attached patch. All tests pass.

The patch adds o.a.l.util.CloseableThreadLocal. It's a wrapper around ThreadLocal that wraps the values inside a WeakReference, but then also holds a strong reference to the value (to ensure GC doesn't reclaim it) until you call the close method. On calling close, GC is then free to reclaim all values you had stored, regardless of how long it takes ThreadLocal's implementation to actually release its references.

There are a couple places in Lucene where I left the current usage of ThreadLocal.

First, Analyzer.java uses ThreadLocal to hold reusable token streams. There is no "close" called for Analyzer, so unless we are willing to add a finalizer to call CloseableThreadLocal.close() I think we can leave it.

Second, some of the contrib/benchmark tasks use ThreadLocal to store per-thread DateFormat which should use tiny memory.


Chris Lu added a comment - 13/Sep/08 03:46 PM
I can confirm this patch fixed the memory problem, in general. Thanks!

However, if not nitpicking, I did noticed in this patch, during the transition time between the previous index searcher/reader close and the new index searcher/reader open, there is a bump in the memory graph.
___/___
There are 2 RAMDirectory instances in the memory for a short period of time.

In previous Lucene version without Lucene-1195, the memory graph looks more clean, with a V dip.
____ ____
\/

And in one index refresh during the same index run, there were 2 RAMDirectory in the memory for a fairly long time. But not repeatable.

So I suspect the closing is kind of delayed. And sometimes it could be missed.

Attaching the screenshot for the memory bump, and 2 the memory graphs of the begin and after of 2RAMDirecory in the memory(needed 2 because the time is fairly long)


Chris Lu added a comment - 13/Sep/08 03:48 PM
A little bump when closing and opening the index. The test did the closing first, then opening.

Chris Lu added a comment - 13/Sep/08 03:50 PM
Beginning of a memory increase of 2 RAMDirectory in the memory. This is also during a index closing, then opening.

Chris Lu added a comment - 13/Sep/08 03:51 PM
Endding of a memory increase of 2 RAMDirectory in the memory. This is also during a index closing, then opening. No particular reason, but the RAMDirectory dropped back to 1.

Chris Lu added a comment - 13/Sep/08 04:05 PM
The test is repeatedly close and open the index.

After the repeat close/open test run finished, I closed the RAMDirectory based searcher/reader, and switched to FSDirectory, the RAMDirectory(which is fairly large) was left in the memory. I did several rounds of GC, but the RAMDirectory was not cleaned. At this time, I tried to do a memory dump, but an OOM occured.


Michael McCandless added a comment - 13/Sep/08 06:23 PM
Hmm, the sometimes long-lasting bump in Used memory is odd.

When you close the old IndexSearcher & RAMDirectory do you forcefully dereference them (set all variables that point to them to null) before then reopening the new ones? (Seems like you must, since you saw the V shape before LUCENE-1195, but I just want to verify). If you simply re-assign the variable from old to new then two objects will exist at once until the opening of the new one finishes.

I think the bump may just be GC taking its time collecting the objects, which should be harmless. Maybe using a WeakReference to a long-lived object causes GC to take longer to collect?

Is it possible to use the profiler to look for a reference to the old RAMDirectory while the bump is "up"? That would be a smoking gun that we do still have a reference, unless it's only via the WeakReference.

Or, could you try lowering the heap size in your JRE to something slightly less than 2X one RAMDirectory (but not so low that the "other stuff" you have in the JRE can't fit). This would force GC to work harder / more immediately in reclaiming the unreachable objects.


Chris Lu added a comment - 14/Sep/08 01:26 AM - edited
On second thought, I think this is normal behavior. Like you said, the GC may take a while before really cleaning up stuff. So if I did not wait a while between the close and open, the little bump should be normal.

So I would say this bug is fixed. Thanks!


Chris Lu added a comment - 14/Sep/08 02:14 AM
this is the V shape memory graph using the patch. This also confirms the leak is fixed.

Michael McCandless added a comment - 14/Sep/08 10:14 AM
OK super! Thanks for testing Chris. I'll commit shortly.

Michael McCandless added a comment - 14/Sep/08 10:33 AM
Thanks Chris!

Michael McCandless added a comment - 14/Sep/08 10:34 AM
Committed revision 695184.

Adrian Tarau added a comment - 01/Oct/08 03:32 PM
Even if the issue is closed, for those who want to know why ThreadLocal had to be fixed : http://www.javaspecialists.eu/archive/Issue164.html

"Best Practices

If you must use ThreadLocal, make sure that you remove the value as soon as you are done with it and preferably before you return your thread to a thread pool. Best practice is to use remove() rather than set(null), since that would cause the WeakReference to be removed immediately, together with the value.

Kind regards

Heinz"


robert engels added a comment - 01/Oct/08 04:03 PM
It doesn't need to be "fixed".

It works fine, you need to understand how to use it properly which may require calling remove() if your threads are long-lived.

Also, remove() was only added in 1.5, and that is why that technique is not valid for Lucene. It is also not valid, since you need to clear all values across all threads, remove() only clears the entry for the calling thread.

The current patch solves the problem suitably for Lucene, at the expense of some performance degradation.


Michael McCandless added a comment - 01/Oct/08 04:21 PM

The current patch solves the problem suitably for Lucene, at the expense of some performance degradation.

The performance degradation should be negligible in the case of long-lived threads. No synchronized code was added in the get() path.

It is also not valid, since you need to clear all values across all threads, remove() only clears the entry for the calling thread.

This best practice does work correctly: you're supposed to call remove() from the very thread that had inserted something into the ThreadLocal.

Besides the 1.5 issue, this is also difficult for Lucene because we don't have a clean point to "know" when the thread has finished interacting with Lucene. A thread comes out of the pool, runs a search, gathers docIDs from in a collector, returns from the search, one by one looks up stored docs / term vectors for these docIDs, maybe does secondary search up front to build up filters, etc., finishes rendering the result and returns to the pool. Unless we create a new method "detachThread(...)" somewhere in Lucene, there is no natural point now to do the remove(). And I don't like creating such a method because nobody will ever know they need to call it in their App server until they start hitting cryptic OOMs.


Adrian Tarau added a comment - 01/Oct/08 04:30 PM
Ok, maybe "fixed" was too much and it sounded like ThreadLocal has a problem. ThreadLocal works fine except that it is tricky with long-lived threads.

Anyway, it is a good article in case somebody wants to understand why this change in Lucene.


Adrian Tarau added a comment - 01/Oct/08 04:41 PM
Michael,

Maybe some sort of cleanup method should be created even if regular users will not call it.

I presume it can be highlighted somewhere in the documentation, like in "Best practices" : http://wiki.apache.org/lucene-java/BestPractices

I would really like to be able to control this, to be able to clean thread(s) storage programmatically.


robert engels added a comment - 01/Oct/08 04:56 PM
You cannot control this 'externally", since there is no way to force the cleaning of the stale entries (no API method).

The current patch sort of allows this - the entries are not cleared but the weak references in them are - via the close(), but if this method is called at the wrong time, everything will break (since the stream needs to be cached), thus it is not exposed to the outside world.

If you want to cleanup programatically, close the index readers and writers.


Adrian Tarau added a comment - 01/Oct/08 05:05 PM
If closing all index readers and writers releases all Lucene thread locals it's great.

Thanks.