Issue Details (XML | Word | Printable)

Key: LUCENE-1195
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael Busch
Reporter: Michael Busch
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Performance improvement for TermInfosReader

Created: 26/Feb/08 10:59 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Index
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works lucene-1195.patch 2008-05-23 02:18 AM Michael Busch 19 kB
Text File Licensed for inclusion in ASF works lucene-1195.patch 2008-05-21 08:37 AM Michael Busch 15 kB
Text File Licensed for inclusion in ASF works lucene-1195.patch 2008-02-27 07:31 PM Michael Busch 4 kB
Java Source File Licensed for inclusion in ASF works SafeThreadLocal.java 2008-09-11 03:12 AM robert engels 1 kB

Lucene Fields: New
Resolution Date: 23/May/08 05:22 PM


 Description  « Hide
Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
The second time when the posting list is opened (TermDocs or TermPositions).

The dictionary lookup is not cheap, that's why a significant performance improvement is
possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
cache to TermInfosReader.

I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
500,000 documents from wikipedia. Here are some test results:

50,000 AND queries with 3 terms each:
old: 152 secs
new (with LRU cache): 112 secs (26% faster)

50,000 OR queries with 3 terms each:
old: 175 secs
new (with LRU cache): 133 secs (24% faster)

For bigger indexes this patch will probably have less impact, for smaller once more.

I will attach a patch soon.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Yonik Seeley added a comment - 27/Feb/08 12:55 AM
Nice results!
I assume this is on a non-optimized index?
One of the unexpected things I've seen in Solr is that enumerating over all the terms of a non-optimized index is a very significant component of the total time (including iterating over termdocs when necessary). This was with terms with a relatively low df though.

Michael Busch added a comment - 27/Feb/08 01:03 AM - edited
Test details:
The index has 500,000 docs and 3191625 unique terms. To construct the queries
I used terms with 1000<df<3000, the index has 3880 of them. I combined the
terms randomly. Each query has at least one hit. The AND queries have 25 hits
on average, the OR queries 5641.

The LRU cache was pretty small with a size of just 20.

The index is unoptimized with 11 segments.

The searcher was warmed for the tests, thus benefiting from FS caching, which
should be a common scenario for indexes of such a medium size.


Yonik Seeley added a comment - 27/Feb/08 01:13 AM
Thinking about a common use in Solr: doing a query and faceting that query by a field... would blow out the cache (due to iterating over all the terms in a single field) if it's a global cache? Is there a good way to prevent that from happening (perhaps just change lucene's existing single -entry thread local cache to a multi-entry thread local cache?)

Eric added a comment - 27/Feb/08 03:28 AM
When faceting(iterating all terms), will each term be looked up twice in dictionary?

Yonik Seeley added a comment - 27/Feb/08 02:31 PM

When faceting(iterating all terms), will each term be looked up twice in dictionary?

No, but consider the case of one thread iterating over all the terms in a field (due to faceting, a range query, prefix query, etc) That would tend to destroy the cache for other threads doing simple queries unless a way could be found to bypass the cache during enumeration somehow, or make each cache thread specific.


Michael Busch added a comment - 27/Feb/08 07:31 PM
Here is the simple patch. The cache is only used in TermInfosReader.get(Term).

So if for example a RangeQuery gets a TermEnum from the IndexReader, then
enumerating the terms using the TermEnum will not replace the terms in the
cache.

The LRUCache itself is not synchronized. It might happen that multiple
threads lookup the same term at the same time, then we might get an cache
miss. But I think such a situation should be very rare, and it's therefore
better to avoid the synchronization overhead?

I set the default cache size to 1024. A cache entry is a (Term, TermInfo)
tuple. TermInfo needs 24 bytes, I think a Term approx. 20-30 bytes? So
the cache would need about 1024 * ~50 bytes = 50Kb plus a bit overhead
from the LinkedHashMap. This is the memory requirement per index segment,
so a non-optimized index with 20 segments would need about 1MB more memory
with this cache. I think this is acceptable? Otherwise we can also decrease
the cache size.

All core & contrib tests pass.


Yonik Seeley added a comment - 27/Feb/08 07:42 PM

The LRUCache itself is not synchronized.

Unfortunately, it needs to be... no getting around it.


Yonik Seeley added a comment - 27/Feb/08 07:46 PM

Here is the simple patch. The cache is only used in TermInfosReader.get(Term).

So if for example a RangeQuery gets a TermEnum from the IndexReader, then
enumerating the terms using the TermEnum will not replace the terms in the
cache.

But for each term, TermDocs.seek() will be called, and that will do a TermInfosReader.get(Term), replacing items in the cache.
For SegmentReader, seek(TermEnum) actually doesn't call TermInfosReader.get(Term), but any kind of multi-reader does.


Michael Busch added a comment - 27/Feb/08 09:03 PM

Unfortunately, it needs to be... no getting around it.

You're right, and I'm stupid
Actually what I meant was that the get() and put() methods don't need to
be synchronized if the underlying data structure, i. e.the LinkedHashMap,
that I'm using is thread-safe, otherwise it might return inconsistent
data.
But the LinkedHashMap is not, unless I decorate it with
Collections.synchronizedMap(). Do you know what's faster? Using the
synchronized map or making get() and put() synchronized? Probably
there's not really a difference, because the decorator that
Collections.synchronizedMap() returns just does essentially the same?


Yonik Seeley added a comment - 27/Feb/08 09:26 PM
There's higher level synchronization too (ensuring that two different threads don't generate the same cache entry at the same time), and I agree that should not be done in this case.

Just use Collections.synchronizedMap(), it will be the same speed, more readable, and can be easily replaced later anyway.


Michael Busch added a comment - 21/May/08 08:37 AM
Changes in the patch:
  • the used cache is thread-safe now
  • added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024 now
  • SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only
    puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g.
    a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into
    the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its
    own cache with such a query.
  • added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831)
    and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches in
    a separate package, because we can reuse them for different things like LUCENE-831 or e. g. after
    deprecating Hits as LRU cache for recently loaded stored documents.

I reran the same performance experiments and it turns out that the speedup is still the same and
the overhead of the ThreadLocal is in the noise. So I think this should be a good approach now?

I also ran similar performance tests on a bigger index with about 4.3 million documents. The
speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup
was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely the
speedup will not be very significant anymore.

All unit tests pass.


Michael Busch added a comment - 23/May/08 02:18 AM
In the previous patch was a silly thread-safety problem that I fixed now.
Some threads in the TestIndexReaderReopen test occasionally hit
errors (I fixed the testcase to fail now whenever an error is hit).

I made some other changes to the TermInfosReader. I'm not using
two ThreadLocals anymore for the SegmentTermEnum and Cache,
but added a small inner class called ThreadResources which holds
references to those two objects. I also minimized the amount of
ThreadLocal.get() calls by passing around the enumerator.

Furthermore I got rid of the private scanEnum() method and inlined
it into the get() method to fix the above mentioned thread-safety
problem. And I also realized that the cache itself does not have to
be thread-safe, because we put it into a ThreadLocal.

I reran the same performance test that I ran for the first patch and
this version seems to be even faster: 107secs vs. 112secs with
the first patch (~30% improvement compared to trunk, 152secs).

All tests pass, including the improved
TestIndexReaderReopen.testThreadSafety(), which I ran multiple
times.

OK I think this patch is ready now, I'm planning to commit it in a
day or so.


Michael Busch added a comment - 23/May/08 05:22 PM
Committed.

Yonik Seeley added a comment - 24/May/08 03:17 PM

SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only
puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g.
a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into
the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its
own cache with such a query.

Hmmm, clever, and pretty much free.

It doesn't seem like it would eliminate something like a RangeQuery adding to the cache, but does reduce the amount of pollution. Seems like about 1/64th of the terms would be added to the cache? (every 128th term and the term following that... due to "numScans > 1" check).

Still, it would take a range query covering 64K terms to completely wipe the cache, and as long as that range query is slow relative to the term lookups, I suppose it doesn't matter much if the cache gets wiped anyway. A single additional hash lookup per term probably shouldn't slow the execution of something like a range query that much either.


robert engels added a comment - 11/Sep/08 03:12 AM
A "safe" ThreadLocal that can be used for more deterministic memory usage.

Probably a bit slower than the JDK ThreadLocal, due to the synchronization.

Offers a "purge()" method to force the cleanup of stale entries. Probably most useful in code like this:

SomeLargeObject slo; // maybe a RAMDirectory?
try { slo = new SomeLargeObject(); // or other creation mechanism; } catch (OutOfMemoryException e) { SafeThreadLocal.purge(); // now try again slo = new SomeLargeObject(); // or other creation mechanism; }


robert engels added a comment - 11/Sep/08 05:16 AM
Also, SafeThreadLocal can be trivially changed to reduce the synchronization times, by using a synchronized map - then only the access is sync'd.

Since a ThreadLocal in Lucene is primarily read (after initial creation), a 1.5 lock designed for read often, write rarely would be best.