Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that don't match the term.

        Activity

        Hide
        Yonik Seeley added a comment - - edited

        Attaching optimization patch. Results up front:
        random seeks to common terms with term enumerator: 58% improvement
        full iteration over all docs matching relatively unique terms: 1595% improvement

        The optimizations:

        • MultiTermEnum keeps track of which segments match... if termDocs.seek(termEnum) is used, then MultiTermDocs will only visit segments that matched the term.
        • MultiTermEnum defers calling next() on sub enumerators until needed. This allows MultiTermDocs to use the faster seek(enum) since the enumerator is still on the correct term. This also avoids unnecessary calls to next() that may never be used, as well as unnecessary insertions into the priority queue. Using seek(enum) in the sub TermDocs also allows cascading of these optimizations (in the event that one has a MultiReader of MultiReaders).

        Test index: this was obviously stacked to show best-case performance for these optimizations. 999,999 documents with maxBufferedDocs=10, resulting in 46 segments. The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)

        Show
        Yonik Seeley added a comment - - edited Attaching optimization patch. Results up front: random seeks to common terms with term enumerator: 58% improvement full iteration over all docs matching relatively unique terms: 1595% improvement The optimizations: MultiTermEnum keeps track of which segments match... if termDocs.seek(termEnum) is used, then MultiTermDocs will only visit segments that matched the term. MultiTermEnum defers calling next() on sub enumerators until needed. This allows MultiTermDocs to use the faster seek(enum) since the enumerator is still on the correct term. This also avoids unnecessary calls to next() that may never be used, as well as unnecessary insertions into the priority queue. Using seek(enum) in the sub TermDocs also allows cascading of these optimizations (in the event that one has a MultiReader of MultiReaders). Test index: this was obviously stacked to show best-case performance for these optimizations. 999,999 documents with maxBufferedDocs=10, resulting in 46 segments. The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)
        Hide
        Paul Elschot added a comment -

        Do I interpret correctly that getting the docIds for terms that are (almost) primary keys on a non optimized index will become a lot faster with this patch?

        Show
        Paul Elschot added a comment - Do I interpret correctly that getting the docIds for terms that are (almost) primary keys on a non optimized index will become a lot faster with this patch?
        Hide
        Yonik Seeley added a comment -

        Yes, if you are doing low level stuff directly on the MultiReader, like using TermEnum/TermDocs. Or calling Filter.getBits(multiReader). As you probably know, if you pass Filters or Queries to an IndexSearcher, it's now dropping down to the segment level already (a scorer is created per-segment) so it won't hit the MultiTermEnum code.

        Show
        Yonik Seeley added a comment - Yes, if you are doing low level stuff directly on the MultiReader, like using TermEnum/TermDocs. Or calling Filter.getBits(multiReader). As you probably know, if you pass Filters or Queries to an IndexSearcher, it's now dropping down to the segment level already (a scorer is created per-segment) so it won't hit the MultiTermEnum code.
        Hide
        Michael McCandless added a comment -

        Patch look good Yonik!

        Show
        Michael McCandless added a comment - Patch look good Yonik!
        Hide
        Yonik Seeley added a comment -

        Hmmm, unfortunately, this patch does nothing for RangeFilter.getDocIdSet(MultiReader), as the term enumerator used there is no longer a MultiTermEnum.

        Show
        Yonik Seeley added a comment - Hmmm, unfortunately, this patch does nothing for RangeFilter.getDocIdSet(MultiReader), as the term enumerator used there is no longer a MultiTermEnum.
        Hide
        Uwe Schindler added a comment -

        Really? It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader - so even when the filtered enum filters some terms, it should use the MultiTermEnum methods to iterate the terms.

        Show
        Uwe Schindler added a comment - Really? It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader - so even when the filtered enum filters some terms, it should use the MultiTermEnum methods to iterate the terms.
        Hide
        Yonik Seeley added a comment -

        It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader

        Yes, but the MultiTermEnum has the internal knowledge about what segments matched the term, and actually has the sub-TermEnums positioned on that term.

        Show
        Yonik Seeley added a comment - It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader Yes, but the MultiTermEnum has the internal knowledge about what segments matched the term, and actually has the sub-TermEnums positioned on that term.
        Hide
        Uwe Schindler added a comment -

        Yes I understand, the problematic call is TermDocs.seek(TermEnum) which does an instanceof check. If it is a FilteredTermEnum or anything other, which is wrapped, the optimization is not used.

        Show
        Uwe Schindler added a comment - Yes I understand, the problematic call is TermDocs.seek(TermEnum) which does an instanceof check. If it is a FilteredTermEnum or anything other, which is wrapped, the optimization is not used.
        Hide
        Yonik Seeley added a comment -

        Getting back to this... although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum.

        Thoughts/objections?

        Show
        Yonik Seeley added a comment - Getting back to this... although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum. Thoughts/objections?
        Hide
        Michael McCandless added a comment -

        although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum.

        +1

        It'd be great to somehow have the optimization apply to RangeFilter as well, though since Lucene has now moved to per-segment searching, it's low priority.

        Show
        Michael McCandless added a comment - although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum. +1 It'd be great to somehow have the optimization apply to RangeFilter as well, though since Lucene has now moved to per-segment searching, it's low priority.
        Hide
        Yonik Seeley added a comment -

        Committed.

        Show
        Yonik Seeley added a comment - Committed.
        Hide
        Michael McCandless added a comment -

        I'm seeing this new AIOOBE when tracking down the intermittent failure
        in TestStressIndexing2:

        1) testRandomIWReader(org.apache.lucene.index.TestStressIndexing2)
        java.lang.ArrayIndexOutOfBoundsException: 6
        	at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.next(MultiSegmentReader.java:672)
        	at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:292)
        	at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:250)
        	at org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:67)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:585)
        	at junit.framework.TestCase.runTest(TestCase.java:168)
        	at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
        	at junit.framework.TestCase.runBare(TestCase.java:134)
        	at junit.framework.TestResult$1.protect(TestResult.java:110)
        	at junit.framework.TestResult.runProtected(TestResult.java:128)
        	at junit.framework.TestResult.run(TestResult.java:113)
        	at junit.framework.TestCase.run(TestCase.java:124)
        	at junit.framework.TestSuite.runTest(TestSuite.java:232)
        	at junit.framework.TestSuite.run(TestSuite.java:227)
        	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
        	at org.junit.internal.runners.CompositeRunner.runChildren(CompositeRunner.java:33)
        	at org.junit.internal.runners.CompositeRunner.run(CompositeRunner.java:28)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:109)
        	at org.junit.runner.JUnitCore.run(JUnitCore.java:100)
        	at org.junit.runner.JUnitCore.runMain(JUnitCore.java:81)
        	at org.junit.runner.JUnitCore.main(JUnitCore.java:44)
        

        I think it's because this optimization isn't admissible in the case
        when one calls MultiTermDocs.seek on a MultiTermEnum derived from a
        different MultiSegmentReader. Ie, I think there needs to be another
        check that verifies in fact the MultiTermEnum passed to
        MultiTermDocs.seek share the same MultiSegmentReader?

        Before this optimiztion, this was OK (only the term was used from the
        MultiTermEnum).

        Show
        Michael McCandless added a comment - I'm seeing this new AIOOBE when tracking down the intermittent failure in TestStressIndexing2: 1) testRandomIWReader(org.apache.lucene.index.TestStressIndexing2) java.lang.ArrayIndexOutOfBoundsException: 6 at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.next(MultiSegmentReader.java:672) at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:292) at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:250) at org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at junit.framework.TestCase.runTest(TestCase.java:168) at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.internal.runners.CompositeRunner.runChildren(CompositeRunner.java:33) at org.junit.internal.runners.CompositeRunner.run(CompositeRunner.java:28) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at org.junit.runner.JUnitCore.run(JUnitCore.java:109) at org.junit.runner.JUnitCore.run(JUnitCore.java:100) at org.junit.runner.JUnitCore.runMain(JUnitCore.java:81) at org.junit.runner.JUnitCore.main(JUnitCore.java:44) I think it's because this optimization isn't admissible in the case when one calls MultiTermDocs.seek on a MultiTermEnum derived from a different MultiSegmentReader. Ie, I think there needs to be another check that verifies in fact the MultiTermEnum passed to MultiTermDocs.seek share the same MultiSegmentReader? Before this optimiztion, this was OK (only the term was used from the MultiTermEnum).
        Hide
        Yonik Seeley added a comment -

        Gah... I forgot it was permissible (or at least not disallowed) to pass an Enum not derived from the same reader.
        I'll fix.

        Show
        Yonik Seeley added a comment - Gah... I forgot it was permissible (or at least not disallowed) to pass an Enum not derived from the same reader. I'll fix.
        Hide
        Yonik Seeley added a comment -

        I just committed the fix (since trunk was broken) and a test that failed w/o the fix, but if anyone has a better idea how to handle/fix, we can certainly still discuss. I just did the obvious - store the multi-reader in the TermEnum and TermDocs instances and compare in TermDocs.seek(TermEnum)

        Show
        Yonik Seeley added a comment - I just committed the fix (since trunk was broken) and a test that failed w/o the fix, but if anyone has a better idea how to handle/fix, we can certainly still discuss. I just did the obvious - store the multi-reader in the TermEnum and TermDocs instances and compare in TermDocs.seek(TermEnum)

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development