Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1195

Performance improvement for TermInfosReader

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
      twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
      The second time when the posting list is opened (TermDocs or TermPositions).

      The dictionary lookup is not cheap, that's why a significant performance improvement is
      possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
      cache to TermInfosReader.

      I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
      500,000 documents from wikipedia. Here are some test results:

      50,000 AND queries with 3 terms each:
      old: 152 secs
      new (with LRU cache): 112 secs (26% faster)

      50,000 OR queries with 3 terms each:
      old: 175 secs
      new (with LRU cache): 133 secs (24% faster)

      For bigger indexes this patch will probably have less impact, for smaller once more.

      I will attach a patch soon.

      1. lucene-1195.patch
        19 kB
        Michael Busch
      2. lucene-1195.patch
        15 kB
        Michael Busch
      3. lucene-1195.patch
        4 kB
        Michael Busch
      4. SafeThreadLocal.java
        1 kB
        robert engels

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Nice results!
        I assume this is on a non-optimized index?
        One of the unexpected things I've seen in Solr is that enumerating over all the terms of a non-optimized index is a very significant component of the total time (including iterating over termdocs when necessary). This was with terms with a relatively low df though.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Nice results! I assume this is on a non-optimized index? One of the unexpected things I've seen in Solr is that enumerating over all the terms of a non-optimized index is a very significant component of the total time (including iterating over termdocs when necessary). This was with terms with a relatively low df though.
        Hide
        michaelbusch Michael Busch added a comment - - edited

        Test details:
        The index has 500,000 docs and 3191625 unique terms. To construct the queries
        I used terms with 1000<df<3000, the index has 3880 of them. I combined the
        terms randomly. Each query has at least one hit. The AND queries have 25 hits
        on average, the OR queries 5641.

        The LRU cache was pretty small with a size of just 20.

        The index is unoptimized with 11 segments.

        The searcher was warmed for the tests, thus benefiting from FS caching, which
        should be a common scenario for indexes of such a medium size.

        Show
        michaelbusch Michael Busch added a comment - - edited Test details: The index has 500,000 docs and 3191625 unique terms. To construct the queries I used terms with 1000<df<3000, the index has 3880 of them. I combined the terms randomly. Each query has at least one hit. The AND queries have 25 hits on average, the OR queries 5641. The LRU cache was pretty small with a size of just 20. The index is unoptimized with 11 segments. The searcher was warmed for the tests, thus benefiting from FS caching, which should be a common scenario for indexes of such a medium size.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Thinking about a common use in Solr: doing a query and faceting that query by a field... would blow out the cache (due to iterating over all the terms in a single field) if it's a global cache? Is there a good way to prevent that from happening (perhaps just change lucene's existing single -entry thread local cache to a multi-entry thread local cache?)

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Thinking about a common use in Solr: doing a query and faceting that query by a field... would blow out the cache (due to iterating over all the terms in a single field) if it's a global cache? Is there a good way to prevent that from happening (perhaps just change lucene's existing single -entry thread local cache to a multi-entry thread local cache?)
        Hide
        blueye118 Eric added a comment -

        When faceting(iterating all terms), will each term be looked up twice in dictionary?

        Show
        blueye118 Eric added a comment - When faceting(iterating all terms), will each term be looked up twice in dictionary?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        When faceting(iterating all terms), will each term be looked up twice in dictionary?

        No, but consider the case of one thread iterating over all the terms in a field (due to faceting, a range query, prefix query, etc) That would tend to destroy the cache for other threads doing simple queries unless a way could be found to bypass the cache during enumeration somehow, or make each cache thread specific.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - When faceting(iterating all terms), will each term be looked up twice in dictionary? No, but consider the case of one thread iterating over all the terms in a field (due to faceting, a range query, prefix query, etc) That would tend to destroy the cache for other threads doing simple queries unless a way could be found to bypass the cache during enumeration somehow, or make each cache thread specific.
        Hide
        michaelbusch Michael Busch added a comment -

        Here is the simple patch. The cache is only used in TermInfosReader.get(Term).

        So if for example a RangeQuery gets a TermEnum from the IndexReader, then
        enumerating the terms using the TermEnum will not replace the terms in the
        cache.

        The LRUCache itself is not synchronized. It might happen that multiple
        threads lookup the same term at the same time, then we might get an cache
        miss. But I think such a situation should be very rare, and it's therefore
        better to avoid the synchronization overhead?

        I set the default cache size to 1024. A cache entry is a (Term, TermInfo)
        tuple. TermInfo needs 24 bytes, I think a Term approx. 20-30 bytes? So
        the cache would need about 1024 * ~50 bytes = 50Kb plus a bit overhead
        from the LinkedHashMap. This is the memory requirement per index segment,
        so a non-optimized index with 20 segments would need about 1MB more memory
        with this cache. I think this is acceptable? Otherwise we can also decrease
        the cache size.

        All core & contrib tests pass.

        Show
        michaelbusch Michael Busch added a comment - Here is the simple patch. The cache is only used in TermInfosReader.get(Term). So if for example a RangeQuery gets a TermEnum from the IndexReader, then enumerating the terms using the TermEnum will not replace the terms in the cache. The LRUCache itself is not synchronized. It might happen that multiple threads lookup the same term at the same time, then we might get an cache miss. But I think such a situation should be very rare, and it's therefore better to avoid the synchronization overhead? I set the default cache size to 1024. A cache entry is a (Term, TermInfo) tuple. TermInfo needs 24 bytes, I think a Term approx. 20-30 bytes? So the cache would need about 1024 * ~50 bytes = 50Kb plus a bit overhead from the LinkedHashMap. This is the memory requirement per index segment, so a non-optimized index with 20 segments would need about 1MB more memory with this cache. I think this is acceptable? Otherwise we can also decrease the cache size. All core & contrib tests pass.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        The LRUCache itself is not synchronized.

        Unfortunately, it needs to be... no getting around it.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - The LRUCache itself is not synchronized. Unfortunately, it needs to be... no getting around it.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Here is the simple patch. The cache is only used in TermInfosReader.get(Term).

        So if for example a RangeQuery gets a TermEnum from the IndexReader, then
        enumerating the terms using the TermEnum will not replace the terms in the
        cache.

        But for each term, TermDocs.seek() will be called, and that will do a TermInfosReader.get(Term), replacing items in the cache.
        For SegmentReader, seek(TermEnum) actually doesn't call TermInfosReader.get(Term), but any kind of multi-reader does.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Here is the simple patch. The cache is only used in TermInfosReader.get(Term). So if for example a RangeQuery gets a TermEnum from the IndexReader, then enumerating the terms using the TermEnum will not replace the terms in the cache. But for each term, TermDocs.seek() will be called, and that will do a TermInfosReader.get(Term), replacing items in the cache. For SegmentReader, seek(TermEnum) actually doesn't call TermInfosReader.get(Term), but any kind of multi-reader does.
        Hide
        michaelbusch Michael Busch added a comment -

        Unfortunately, it needs to be... no getting around it.

        You're right, and I'm stupid
        Actually what I meant was that the get() and put() methods don't need to
        be synchronized if the underlying data structure, i. e.the LinkedHashMap,
        that I'm using is thread-safe, otherwise it might return inconsistent
        data.
        But the LinkedHashMap is not, unless I decorate it with
        Collections.synchronizedMap(). Do you know what's faster? Using the
        synchronized map or making get() and put() synchronized? Probably
        there's not really a difference, because the decorator that
        Collections.synchronizedMap() returns just does essentially the same?

        Show
        michaelbusch Michael Busch added a comment - Unfortunately, it needs to be... no getting around it. You're right, and I'm stupid Actually what I meant was that the get() and put() methods don't need to be synchronized if the underlying data structure, i. e.the LinkedHashMap, that I'm using is thread-safe, otherwise it might return inconsistent data. But the LinkedHashMap is not, unless I decorate it with Collections.synchronizedMap(). Do you know what's faster? Using the synchronized map or making get() and put() synchronized? Probably there's not really a difference, because the decorator that Collections.synchronizedMap() returns just does essentially the same?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        There's higher level synchronization too (ensuring that two different threads don't generate the same cache entry at the same time), and I agree that should not be done in this case.

        Just use Collections.synchronizedMap(), it will be the same speed, more readable, and can be easily replaced later anyway.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - There's higher level synchronization too (ensuring that two different threads don't generate the same cache entry at the same time), and I agree that should not be done in this case. Just use Collections.synchronizedMap(), it will be the same speed, more readable, and can be easily replaced later anyway.
        Hide
        michaelbusch Michael Busch added a comment -

        Changes in the patch:

        • the used cache is thread-safe now
        • added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024 now
        • SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only
          puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g.
          a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into
          the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its
          own cache with such a query.
        • added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831)
          and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches in
          a separate package, because we can reuse them for different things like LUCENE-831 or e. g. after
          deprecating Hits as LRU cache for recently loaded stored documents.

        I reran the same performance experiments and it turns out that the speedup is still the same and
        the overhead of the ThreadLocal is in the noise. So I think this should be a good approach now?

        I also ran similar performance tests on a bigger index with about 4.3 million documents. The
        speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup
        was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely the
        speedup will not be very significant anymore.

        All unit tests pass.

        Show
        michaelbusch Michael Busch added a comment - Changes in the patch: the used cache is thread-safe now added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024 now SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g. a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its own cache with such a query. added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831 ) and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches in a separate package, because we can reuse them for different things like LUCENE-831 or e. g. after deprecating Hits as LRU cache for recently loaded stored documents. I reran the same performance experiments and it turns out that the speedup is still the same and the overhead of the ThreadLocal is in the noise. So I think this should be a good approach now? I also ran similar performance tests on a bigger index with about 4.3 million documents. The speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely the speedup will not be very significant anymore. All unit tests pass.
        Hide
        michaelbusch Michael Busch added a comment -

        In the previous patch was a silly thread-safety problem that I fixed now.
        Some threads in the TestIndexReaderReopen test occasionally hit
        errors (I fixed the testcase to fail now whenever an error is hit).

        I made some other changes to the TermInfosReader. I'm not using
        two ThreadLocals anymore for the SegmentTermEnum and Cache,
        but added a small inner class called ThreadResources which holds
        references to those two objects. I also minimized the amount of
        ThreadLocal.get() calls by passing around the enumerator.

        Furthermore I got rid of the private scanEnum() method and inlined
        it into the get() method to fix the above mentioned thread-safety
        problem. And I also realized that the cache itself does not have to
        be thread-safe, because we put it into a ThreadLocal.

        I reran the same performance test that I ran for the first patch and
        this version seems to be even faster: 107secs vs. 112secs with
        the first patch (~30% improvement compared to trunk, 152secs).

        All tests pass, including the improved
        TestIndexReaderReopen.testThreadSafety(), which I ran multiple
        times.

        OK I think this patch is ready now, I'm planning to commit it in a
        day or so.

        Show
        michaelbusch Michael Busch added a comment - In the previous patch was a silly thread-safety problem that I fixed now. Some threads in the TestIndexReaderReopen test occasionally hit errors (I fixed the testcase to fail now whenever an error is hit). I made some other changes to the TermInfosReader. I'm not using two ThreadLocals anymore for the SegmentTermEnum and Cache, but added a small inner class called ThreadResources which holds references to those two objects. I also minimized the amount of ThreadLocal.get() calls by passing around the enumerator. Furthermore I got rid of the private scanEnum() method and inlined it into the get() method to fix the above mentioned thread-safety problem. And I also realized that the cache itself does not have to be thread-safe, because we put it into a ThreadLocal. I reran the same performance test that I ran for the first patch and this version seems to be even faster: 107secs vs. 112secs with the first patch (~30% improvement compared to trunk, 152secs). All tests pass, including the improved TestIndexReaderReopen.testThreadSafety(), which I ran multiple times. OK I think this patch is ready now, I'm planning to commit it in a day or so.
        Hide
        michaelbusch Michael Busch added a comment -

        Committed.

        Show
        michaelbusch Michael Busch added a comment - Committed.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only
        puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g.
        a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into
        the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its
        own cache with such a query.

        Hmmm, clever, and pretty much free.

        It doesn't seem like it would eliminate something like a RangeQuery adding to the cache, but does reduce the amount of pollution. Seems like about 1/64th of the terms would be added to the cache? (every 128th term and the term following that... due to "numScans > 1" check).

        Still, it would take a range query covering 64K terms to completely wipe the cache, and as long as that range query is slow relative to the term lookups, I suppose it doesn't matter much if the cache gets wiped anyway. A single additional hash lookup per term probably shouldn't slow the execution of something like a range query that much either.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g. a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its own cache with such a query. Hmmm, clever, and pretty much free. It doesn't seem like it would eliminate something like a RangeQuery adding to the cache, but does reduce the amount of pollution. Seems like about 1/64th of the terms would be added to the cache? (every 128th term and the term following that... due to "numScans > 1" check). Still, it would take a range query covering 64K terms to completely wipe the cache, and as long as that range query is slow relative to the term lookups, I suppose it doesn't matter much if the cache gets wiped anyway. A single additional hash lookup per term probably shouldn't slow the execution of something like a range query that much either.
        Hide
        rengels@ix.netcom.com robert engels added a comment -

        A "safe" ThreadLocal that can be used for more deterministic memory usage.

        Probably a bit slower than the JDK ThreadLocal, due to the synchronization.

        Offers a "purge()" method to force the cleanup of stale entries. Probably most useful in code like this:

        SomeLargeObject slo; // maybe a RAMDirectory?
        try

        { slo = new SomeLargeObject(); // or other creation mechanism; }

        catch (OutOfMemoryException e)

        { SafeThreadLocal.purge(); // now try again slo = new SomeLargeObject(); // or other creation mechanism; }
        Show
        rengels@ix.netcom.com robert engels added a comment - A "safe" ThreadLocal that can be used for more deterministic memory usage. Probably a bit slower than the JDK ThreadLocal, due to the synchronization. Offers a "purge()" method to force the cleanup of stale entries. Probably most useful in code like this: SomeLargeObject slo; // maybe a RAMDirectory? try { slo = new SomeLargeObject(); // or other creation mechanism; } catch (OutOfMemoryException e) { SafeThreadLocal.purge(); // now try again slo = new SomeLargeObject(); // or other creation mechanism; }
        Hide
        rengels@ix.netcom.com robert engels added a comment -

        Also, SafeThreadLocal can be trivially changed to reduce the synchronization times, by using a synchronized map - then only the access is sync'd.

        Since a ThreadLocal in Lucene is primarily read (after initial creation), a 1.5 lock designed for read often, write rarely would be best.

        Show
        rengels@ix.netcom.com robert engels added a comment - Also, SafeThreadLocal can be trivially changed to reduce the synchronization times, by using a synchronized map - then only the access is sync'd. Since a ThreadLocal in Lucene is primarily read (after initial creation), a 1.5 lock designed for read often, write rarely would be best.

          People

          • Assignee:
            michaelbusch Michael Busch
            Reporter:
            michaelbusch Michael Busch
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development