Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-1458.

      The flex branch is in fairly good shape – all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman

      But it still has a number of nocommits, could use some more scrutiny especially on the "emulate old API on flex index" and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes.

      The end is in sight!

      1. LUCENE-2111.patch
        5 kB
        Michael McCandless
      2. Make2BTermsIndex.java
        1 kB
        Michael McCandless
      3. LUCENE-2111.patch
        20 kB
        Michael McCandless
      4. flexBench.py
        5 kB
        Michael McCandless
      5. benchUtil.py
        11 kB
        Michael McCandless
      6. LUCENE-2111.patch
        18 kB
        Michael McCandless
      7. LUCENE-2111.patch
        100 kB
        Michael McCandless
      8. LUCENE-2111.patch
        97 kB
        Michael McCandless
      9. LUCENE-2111.patch
        97 kB
        Michael McCandless
      10. LUCENE-2111.patch
        174 kB
        Michael McCandless
      11. LUCENE-2111.patch
        142 kB
        Michael McCandless
      12. LUCENE-2111.patch
        143 kB
        Michael McCandless
      13. LUCENE-2111.patch
        116 kB
        Michael McCandless
      14. LUCENE-2111_mtqTest.patch
        8 kB
        Robert Muir
      15. LUCENE-2111_mtqNull.patch
        4 kB
        Robert Muir
      16. LUCENE-2111.patch
        95 kB
        Michael McCandless
      17. flex_backwards_merge_912395.patch
        33 kB
        Robert Muir
      18. flex_merge_916543.patch
        1.82 MB
        Robert Muir
      19. LUCENE-2111.patch
        38 kB
        Michael McCandless
      20. LUCENE-2111.patch
        3 kB
        Robert Muir
      21. LUCENE-2111_toString.patch
        5 kB
        Robert Muir
      22. LUCENE-2111.patch
        39 kB
        Michael McCandless
      23. LUCENE-2111_experimental.patch
        29 kB
        Robert Muir
      24. LUCENE-2111_bytesRef.patch
        27 kB
        Robert Muir
      25. LUCENE-2111-EmptyTermsEnum.patch
        8 kB
        Uwe Schindler
      26. LUCENE-2111-EmptyTermsEnum.patch
        5 kB
        Uwe Schindler
      27. LUCENE-2111_fuzzy.patch
        6 kB
        Robert Muir
      28. LUCENE-2111.patch
        148 kB
        Michael McCandless
      29. LUCENE-2111.patch
        407 kB
        Michael McCandless
      30. LUCENE-2111.patch
        54 kB
        Michael McCandless
      31. LUCENE-2111.patch
        150 kB
        Michael McCandless
      32. LUCENE-2111.patch
        77 kB
        Michael McCandless
      33. LUCENE-2111.patch
        12 kB
        Michael McCandless
      34. LUCENE-2111.patch
        24 kB
        Michael McCandless

        Issue Links

          Activity

          Michael McCandless created issue -
          Uwe Schindler made changes -
          Field Original Value New Value
          Affects Version/s Flex Branch [ 12314439 ]
          Uwe Schindler made changes -
          Link This issue is blocked by LUCENE-2109 [ LUCENE-2109 ]
          Uwe Schindler made changes -
          Link This issue is blocked by LUCENE-2110 [ LUCENE-2110 ]
          Hide
          Uwe Schindler added a comment -

          Also the current flex branch produces lots of unchecked warnings... The Generics policeman will visit them and will hopefully help fixing!

          Show
          Uwe Schindler added a comment - Also the current flex branch produces lots of unchecked warnings... The Generics policeman will visit them and will hopefully help fixing!
          Hide
          Michael McCandless added a comment -

          The Generics policeman will visit them and will hopefully help fixing!

          Uh-oh... I sense heavy committing in flex branch's future!

          Show
          Michael McCandless added a comment - The Generics policeman will visit them and will hopefully help fixing! Uh-oh... I sense heavy committing in flex branch's future!
          Hide
          Karthik K added a comment -

          What would the branch name for flex indexing ?

          Show
          Karthik K added a comment - What would the branch name for flex indexing ?
          Hide
          Uwe Schindler added a comment - - edited

          I Mike,

          maybe I found an issue in LegacyFieldsEnum & Co:

          • The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly.
          • Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?

          Uh-oh... I sense heavy committing in flex branch's future!

          It is not so much, mainly in code added before the final Java 1.5 switch. Should be easy to fix, will look into it in a few days. So the police inspector only has few minor complaints

          Show
          Uwe Schindler added a comment - - edited I Mike, maybe I found an issue in LegacyFieldsEnum & Co: The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly. Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close? Uh-oh... I sense heavy committing in flex branch's future! It is not so much, mainly in code added before the final Java 1.5 switch. Should be easy to fix, will look into it in a few days. So the police inspector only has few minor complaints
          Hide
          Michael McCandless added a comment -

          What would the branch name for flex indexing ?

          It's https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458

          Show
          Michael McCandless added a comment - What would the branch name for flex indexing ? It's https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
          Hide
          Michael McCandless added a comment -

          The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly.

          Nice catch – I'll open a new issue & fix.

          Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?

          I actually intentionally left .close() out of all *Enums.

          First, to strongly bias impls from doing such costly things that close
          is necessary. These enums are used in hotspots during searching.

          Second, because for Lucene's core impls, close() is [almost?] always a
          no-op: these impls are very lightweight.

          Third, because in Lucene we don't consistently close the enums we
          pull, today, so we have confusion (there have been posts to java-user
          about this – "do I need to close the TermEnum/TermDocs"). I'd rather
          not add a "close" that for all core impls is a no-op and so Lucene
          doesn't have to call close.

          Fourth, because it complicates our impls if we really must close
          whenever we pull a enum – eg our Scorers pull enums today, but never
          close them.

          Show
          Michael McCandless added a comment - The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly. Nice catch – I'll open a new issue & fix. Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close? I actually intentionally left .close() out of all *Enums. First, to strongly bias impls from doing such costly things that close is necessary. These enums are used in hotspots during searching. Second, because for Lucene's core impls, close() is [almost?] always a no-op: these impls are very lightweight. Third, because in Lucene we don't consistently close the enums we pull, today, so we have confusion (there have been posts to java-user about this – "do I need to close the TermEnum/TermDocs"). I'd rather not add a "close" that for all core impls is a no-op and so Lucene doesn't have to call close. Fourth, because it complicates our impls if we really must close whenever we pull a enum – eg our Scorers pull enums today, but never close them.
          Hide
          Michael McCandless added a comment -

          Attached patch – will commit soon. I found a bug in the "flex API on
          non-flex" layer (the preflex codec) – exposed with a new test case in
          TestBackCompat, and fixed. Also cleaned up some nocommits, added
          indexDivisor to the loadTermsIndex API, and fixed preflex to actually
          implement it.

          Show
          Michael McCandless added a comment - Attached patch – will commit soon. I found a bug in the "flex API on non-flex" layer (the preflex codec) – exposed with a new test case in TestBackCompat, and fixed. Also cleaned up some nocommits, added indexDivisor to the loadTermsIndex API, and fixed preflex to actually implement it.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12427038 ]
          Robert Muir made changes -
          Link This issue blocks LUCENE-1606 [ LUCENE-1606 ]
          Michael McCandless made changes -
          Link This issue blocks LUCENE-2141 [ LUCENE-2141 ]
          Hide
          Michael McCandless added a comment -

          Attached patch, which adds some nice test coverage of the back compat layers. We still need more tests but this is a good step forward...

          Show
          Michael McCandless added a comment - Attached patch, which adds some nice test coverage of the back compat layers. We still need more tests but this is a good step forward...
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12428544 ]
          Hide
          Michael McCandless added a comment -

          New patch attached – strengthened the back compat testing, which
          uncovered some issues in the back compat layers, that I've now
          fixed.

          I ran the FlexTestUtil.verifyFlexVsPreFlex on a 5M doc pre-flex
          wikipedia index, and a 1M doc flex wikipedia index, with no problems.

          I also ran my NRT stress test, starting from a pre-flex 5M wikipedia,
          and indexing using flex, so this is a good test of mixed pre/post flex
          segments, with no problems.

          Getting closer...

          Show
          Michael McCandless added a comment - New patch attached – strengthened the back compat testing, which uncovered some issues in the back compat layers, that I've now fixed. I ran the FlexTestUtil.verifyFlexVsPreFlex on a 5M doc pre-flex wikipedia index, and a 1M doc flex wikipedia index, with no problems. I also ran my NRT stress test, starting from a pre-flex 5M wikipedia, and indexing using flex, so this is a good test of mixed pre/post flex segments, with no problems. Getting closer...
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12428587 ]
          Hide
          Michael McCandless added a comment -

          Attached patch, changing oal.index.TermRef -> oal.util.BytesRef.

          I think, eventually, we should fix the various places that refer to byte slices, eg Field.get/setBinary*, Payload, UnicodeUtil.UTF8Result, IndexOutput.writeBytes, IndexInput.readBytes, to use BytesRef instead.

          Show
          Michael McCandless added a comment - Attached patch, changing oal.index.TermRef -> oal.util.BytesRef. I think, eventually, we should fix the various places that refer to byte slices, eg Field.get/setBinary*, Payload, UnicodeUtil.UTF8Result, IndexOutput.writeBytes, IndexInput.readBytes, to use BytesRef instead.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12430547 ]
          Hide
          Michael McCandless added a comment -

          Attached patch w/ various fixes:

          • Switch over payloads to use BytesRef, in flex API
          • DocsEnum.positions now returns null if no positions were indexed
            (ie omitTFAP was set for the field). Also fixed Phrase/SpanQuery
            to throw IllegalStateException when run against an omitTFAP
            field.
          • Rename PositionsConsumer.addPosition -> .add
          Show
          Michael McCandless added a comment - Attached patch w/ various fixes: Switch over payloads to use BytesRef, in flex API DocsEnum.positions now returns null if no positions were indexed (ie omitTFAP was set for the field). Also fixed Phrase/SpanQuery to throw IllegalStateException when run against an omitTFAP field. Rename PositionsConsumer.addPosition -> .add
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12430744 ]
          Hide
          Michael McCandless added a comment -

          Attached patch with a major reworking of some parts of flex:

          • Simplified how the StandardTermsDictReader/Writer interacts with
            the postings impl. The PostingsReader for the codec is now
            stateless, capturing all state for a given term in a dedicated
            TermState class (which also works well w/ caching, since we needed
            to capture state for that anyway).
          • Merged docs & positions readers, in the codec's impl and in the
            exposed flex API. It was just too hairy before, with separate
            classes for reading docs & positions. This is a step back towards
            current trunk API, ie, up front you ask for either a DocsEnum or a
            DocsAndPositionsEnum.
          • Modified API semantics: if a field or term does not exist, then
            IndexReader.termDocs/PositionsEnum may now return null (previously
            they returned a fake empty enum). This means more Weight.scorer()
            may return null.
          • I added IndexReader.getSubReaderDocBase (there is a separate jira
            issue open for this) – this is now more important because a
            filter can no longer guess its doc base by adding up docCount of
            all readers it sees since if the scorer for that segment is null,
            Filter.getDocIdSet will not be called.
          • Changed the reuse of Docs/AndPositionsEnum to be explicit.
            Previously the Terms or TermsEnum instance was holding a private
            reused instance... but that was no good because typically we can
            share the TermsEnum but cannot share postings enums.
          • Likeways, changed the public flex reading API, so that you don't
            separately ask for positions enum at each doc. Instead, up front
            you either ask for a DocsEnum or a DocsAndPositionsEnum. This
            matches how the current Lucene APIs work.
          • Terms dict cache is now at the top level, not per field (this
            matches how trunk works, ie all fields share the 1024 sized cache)

          I cutover all codecs to the new API... all tests pass if you switch
          the default codec (in oal.index.codec.Codecs.getWriter) to any of the
          four.

          Show
          Michael McCandless added a comment - Attached patch with a major reworking of some parts of flex: Simplified how the StandardTermsDictReader/Writer interacts with the postings impl. The PostingsReader for the codec is now stateless, capturing all state for a given term in a dedicated TermState class (which also works well w/ caching, since we needed to capture state for that anyway). Merged docs & positions readers, in the codec's impl and in the exposed flex API. It was just too hairy before, with separate classes for reading docs & positions. This is a step back towards current trunk API, ie, up front you ask for either a DocsEnum or a DocsAndPositionsEnum. Modified API semantics: if a field or term does not exist, then IndexReader.termDocs/PositionsEnum may now return null (previously they returned a fake empty enum). This means more Weight.scorer() may return null. I added IndexReader.getSubReaderDocBase (there is a separate jira issue open for this) – this is now more important because a filter can no longer guess its doc base by adding up docCount of all readers it sees since if the scorer for that segment is null, Filter.getDocIdSet will not be called. Changed the reuse of Docs/AndPositionsEnum to be explicit. Previously the Terms or TermsEnum instance was holding a private reused instance... but that was no good because typically we can share the TermsEnum but cannot share postings enums. Likeways, changed the public flex reading API, so that you don't separately ask for positions enum at each doc. Instead, up front you either ask for a DocsEnum or a DocsAndPositionsEnum. This matches how the current Lucene APIs work. Terms dict cache is now at the top level, not per field (this matches how trunk works, ie all fields share the 1024 sized cache) I cutover all codecs to the new API... all tests pass if you switch the default codec (in oal.index.codec.Codecs.getWriter) to any of the four.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12431657 ]
          Hide
          Michael McCandless added a comment -

          New flex patch attached:

          • I factored out separate public Multi* (Fields, Terms, etc.) from
            DirectoryReader, DirectoryReader. These classes merge multiple
            flex "sub readers" into a single flex API on the fly.
          • Refactored all places that need to merge sub-readers to use this
            API (DirectoryReader, MultiReader, SegmentMerger). This is
            cleaner because previously SegmentMerger had its own duplicated
            code for doing this merging; now we have a single source for it
            (though merging swaps in its own docs/positions enum, to remap
            docIDs around deletions).
          • Changed the semantics of IndexReader.fields() – for a multi
            reader (any reader that consist of sequential sub readers),
            fields() now throws UOE.

          This is an important change with flex – the caller now bears
          responsibility for create a MultiFields if they really need it.

          My thinking is that primary places in Lucene that consume postings now
          operate per-segment, so a multi reader (Dir/MultiReader) should not
          automatically "join up high" because it entails a hidden performance
          hit. So consumers that must access the flex API at the multi reader
          level should be explicit about it...

          However, to make this simple, I created a sugar static methods on
          MultiFields (eg, MultiFields.getFields(IndexReader)) to easily do
          this, and cutover places in Lucene that may need direct postings from
          a multi-reader to use this method.

          I've updated the javadocs explaining this.

          Show
          Michael McCandless added a comment - New flex patch attached: I factored out separate public Multi* (Fields, Terms, etc.) from DirectoryReader, DirectoryReader. These classes merge multiple flex "sub readers" into a single flex API on the fly. Refactored all places that need to merge sub-readers to use this API (DirectoryReader, MultiReader, SegmentMerger). This is cleaner because previously SegmentMerger had its own duplicated code for doing this merging; now we have a single source for it (though merging swaps in its own docs/positions enum, to remap docIDs around deletions). Changed the semantics of IndexReader.fields() – for a multi reader (any reader that consist of sequential sub readers), fields() now throws UOE. This is an important change with flex – the caller now bears responsibility for create a MultiFields if they really need it. My thinking is that primary places in Lucene that consume postings now operate per-segment, so a multi reader (Dir/MultiReader) should not automatically "join up high" because it entails a hidden performance hit. So consumers that must access the flex API at the multi reader level should be explicit about it... However, to make this simple, I created a sugar static methods on MultiFields (eg, MultiFields.getFields(IndexReader)) to easily do this, and cutover places in Lucene that may need direct postings from a multi-reader to use this method. I've updated the javadocs explaining this.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12435284 ]
          Hide
          Robert Muir added a comment -

          Mike, here is a patch for removal of fuzzy nocommits:

          • remove synchronization (not necessary, history here: LUCENE-296)
          • reuse char[] rather than create Strings
          • remove unused ctors
          Show
          Robert Muir added a comment - Mike, here is a patch for removal of fuzzy nocommits: remove synchronization (not necessary, history here: LUCENE-296 ) reuse char[] rather than create Strings remove unused ctors
          Robert Muir made changes -
          Attachment LUCENE-2111_fuzzy.patch [ 12435411 ]
          Hide
          Robert Muir added a comment -

          btw, i benched that patch with my contrived benchmark for LUCENE-2089, wierd that flex was slower than trunk before.
          numbers are stable across many iterations.

          unpatched flex patched flex trunk
          4362ms 3239ms 3459ms
          Show
          Robert Muir added a comment - btw, i benched that patch with my contrived benchmark for LUCENE-2089 , wierd that flex was slower than trunk before. numbers are stable across many iterations. unpatched flex patched flex trunk 4362ms 3239ms 3459ms
          Hide
          Uwe Schindler added a comment -

          Mike: I reviewed this EmptyTermsEnum in MTQ. I would leave it in, but simply make EmptyTermsEnum a singleton (which is perfectly fine, because its stateless). Returning null here makes no performance in MTQs, it only makes the code in MTQ#rewrite and MTQWF#getDocIdSet ugly. The biggest problem with returning null here is the backwards layer that must be fixed then (because it checks if getTermsEnum return null and falls back to FilteredTermEnum from trunk). If you really want null, getTermsEnum should per default (if not overriddden) throw UOE and the rewrite code should catch this UOE and only then delegate to backwards layer.

          Show
          Uwe Schindler added a comment - Mike: I reviewed this EmptyTermsEnum in MTQ. I would leave it in, but simply make EmptyTermsEnum a singleton (which is perfectly fine, because its stateless). Returning null here makes no performance in MTQs, it only makes the code in MTQ#rewrite and MTQWF#getDocIdSet ugly. The biggest problem with returning null here is the backwards layer that must be fixed then (because it checks if getTermsEnum return null and falls back to FilteredTermEnum from trunk). If you really want null, getTermsEnum should per default (if not overriddden) throw UOE and the rewrite code should catch this UOE and only then delegate to backwards layer.
          Hide
          Uwe Schindler added a comment -

          Here the EmptyTermsEnum singleton patch (against flex trunk).

          Show
          Uwe Schindler added a comment - Here the EmptyTermsEnum singleton patch (against flex trunk).
          Uwe Schindler made changes -
          Attachment LUCENE-2111-EmptyTermsEnum.patch [ 12435424 ]
          Hide
          Michael McCandless added a comment -

          Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.

          Show
          Michael McCandless added a comment - Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.
          Hide
          Uwe Schindler added a comment -

          Updated patch for empty TermsEnum. It is now a singleton in TermsEnum class itsself.

          Show
          Uwe Schindler added a comment - Updated patch for empty TermsEnum. It is now a singleton in TermsEnum class itsself.
          Uwe Schindler made changes -
          Attachment LUCENE-2111-EmptyTermsEnum.patch [ 12435435 ]
          Hide
          Michael McCandless added a comment -

          New patch looks good Uwe – thanks for re-merging!

          Show
          Michael McCandless added a comment - New patch looks good Uwe – thanks for re-merging!
          Hide
          Michael McCandless added a comment -

          Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.

          Ahh, and also because your patch switches from String to char[], which should improve perf.

          Your patch looks good Robert! Thanks.

          Show
          Michael McCandless added a comment - Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that. Ahh, and also because your patch switches from String to char[], which should improve perf. Your patch looks good Robert! Thanks.
          Hide
          Robert Muir added a comment -

          Ahh, and also because your patch switches from String to char[], which should improve perf.

          actually i didnt apply your LUCENE-2111 when running the benchmark (the improvement is simply the char[]).
          the test is now actually slightly slower now with the rest of LUCENE-2111

          Show
          Robert Muir added a comment - Ahh, and also because your patch switches from String to char[], which should improve perf. actually i didnt apply your LUCENE-2111 when running the benchmark (the improvement is simply the char[]). the test is now actually slightly slower now with the rest of LUCENE-2111
          Hide
          Robert Muir added a comment -

          here is a rough patch, that merges BytesRef/UnicodeUtil

          • added a fn to UnicodeUtil that computes the hash as it encodes, for termshash
          • when doing utf16->utf8 conversion, remove an if for every character by simple allocating utf16*4
          • some method signatures were generalized from String -> CharSequence (i.e. so you could create a BytesRef from a StringBuilder if you want)

          there are some breaks (e.g. binary api compat), but its an internal api.

          Show
          Robert Muir added a comment - here is a rough patch, that merges BytesRef/UnicodeUtil added a fn to UnicodeUtil that computes the hash as it encodes, for termshash when doing utf16->utf8 conversion, remove an if for every character by simple allocating utf16*4 some method signatures were generalized from String -> CharSequence (i.e. so you could create a BytesRef from a StringBuilder if you want) there are some breaks (e.g. binary api compat), but its an internal api.
          Robert Muir made changes -
          Attachment LUCENE-2111_bytesRef.patch [ 12436728 ]
          Hide
          Michael McCandless added a comment -

          Patch looks good Robert! Thanks

          Why not just remove UTF8Result altogether? (ie don't bother deprecating). It's an internal API...

          The new method to compute hash is great, saving the extra pass in THPF.

          Show
          Michael McCandless added a comment - Patch looks good Robert! Thanks Why not just remove UTF8Result altogether? (ie don't bother deprecating). It's an internal API... The new method to compute hash is great, saving the extra pass in THPF.
          Hide
          Robert Muir added a comment -

          ok, i ditched UTF8 result entirely, committed in revision 915511

          Show
          Robert Muir added a comment - ok, i ditched UTF8 result entirely, committed in revision 915511
          Hide
          Robert Muir added a comment -

          attached is a patch that changes various exposed apis to use @lucene.experimental

          i didnt mess with IndexFileNames as there is an open issue about it right now.

          Show
          Robert Muir added a comment - attached is a patch that changes various exposed apis to use @lucene.experimental i didnt mess with IndexFileNames as there is an open issue about it right now.
          Robert Muir made changes -
          Attachment LUCENE-2111_experimental.patch [ 12436846 ]
          Hide
          Robert Muir added a comment -

          these tags are added in revision 915791.

          Show
          Robert Muir added a comment - these tags are added in revision 915791.
          Hide
          Michael McCandless added a comment -

          Attached patch, fixing some more nocommits, and renaming BytesRef.toString -> BytesRef.utf8ToString.

          Show
          Michael McCandless added a comment - Attached patch, fixing some more nocommits, and renaming BytesRef.toString -> BytesRef.utf8ToString.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12436853 ]
          Hide
          Robert Muir added a comment -

          Here is a few more toString -> utf8ToString.
          will look at the backwards tests now

          Show
          Robert Muir added a comment - Here is a few more toString -> utf8ToString. will look at the backwards tests now
          Robert Muir made changes -
          Attachment LUCENE-2111_toString.patch [ 12436857 ]
          Hide
          Robert Muir added a comment -

          a few more easy nocommits

          Show
          Robert Muir added a comment - a few more easy nocommits
          Robert Muir made changes -
          Attachment LUCENE-2111.patch [ 12437037 ]
          Hide
          Michael McCandless added a comment -

          Attached patch, fixes flex APIs to not return null (instead return .EMPTY objects).

          Show
          Michael McCandless added a comment - Attached patch, fixes flex APIs to not return null (instead return .EMPTY objects).
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12437074 ]
          Hide
          Robert Muir added a comment -

          patch for review of flex merge

          Show
          Robert Muir added a comment - patch for review of flex merge
          Robert Muir made changes -
          Attachment flex_merge_916543.patch [ 12437150 ]
          Hide
          Robert Muir added a comment -

          patch for review, backwards tests merge to flex

          Show
          Robert Muir added a comment - patch for review, backwards tests merge to flex
          Robert Muir made changes -
          Attachment flex_backwards_merge_912395.patch [ 12437151 ]
          Hide
          Michael McCandless added a comment -

          Cuts over to returning .EMPTY instead of null when requesting enums.

          Also disallows Multi/DirReader.getDeletedDocs in favor of static convenience method MultiFields.getDeletedDocs.

          But... we now need a way to determine that a codec does not store positions. I [illegally, adding nocommits] had to add a few places where I check the return result against .EMPTY.

          Show
          Michael McCandless added a comment - Cuts over to returning .EMPTY instead of null when requesting enums. Also disallows Multi/DirReader.getDeletedDocs in favor of static convenience method MultiFields.getDeletedDocs. But... we now need a way to determine that a codec does not store positions. I [illegally, adding nocommits] had to add a few places where I check the return result against .EMPTY.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438097 ]
          Hide
          Michael McCandless added a comment -

          But... we now need a way to determine that a codec does not store positions.

          Thinking more about this... I think we should switch back to a null return from .docsAndPositionsEnum if the codec doesn't support positions. We only return .EMPTY if the enum is really just empty.

          Show
          Michael McCandless added a comment - But... we now need a way to determine that a codec does not store positions. Thinking more about this... I think we should switch back to a null return from .docsAndPositionsEnum if the codec doesn't support positions. We only return .EMPTY if the enum is really just empty.
          Hide
          Robert Muir added a comment -

          Patch for MTQ to not use null in getTermsEnum for backwards compat,
          so null can have some other meaning. Instead it uses VirtualMethod,
          with the default implementatinos throwing UOE.

          Show
          Robert Muir added a comment - Patch for MTQ to not use null in getTermsEnum for backwards compat, so null can have some other meaning. Instead it uses VirtualMethod, with the default implementatinos throwing UOE.
          Robert Muir made changes -
          Attachment LUCENE-2111_mtqNull.patch [ 12438102 ]
          Hide
          Robert Muir added a comment -

          Uwe asked for a test for the MTQ back compat... attached is one.
          I think it looks kinda dumb but if its useful, I'll commit it.

          Show
          Robert Muir added a comment - Uwe asked for a test for the MTQ back compat... attached is one. I think it looks kinda dumb but if its useful, I'll commit it.
          Robert Muir made changes -
          Attachment LUCENE-2111_mtqTest.patch [ 12438136 ]
          Hide
          Michael McCandless added a comment -

          Changed .iterator() to never return null, MTQ.getTermsEnum() to never return null, but IR.fields() and Fields.terms(String field), and .docs/.docsAndPositions can return null.

          Also whittled down more nocommits – down to 53 now!

          Show
          Michael McCandless added a comment - Changed .iterator() to never return null, MTQ.getTermsEnum() to never return null, but IR.fields() and Fields.terms(String field), and .docs/.docsAndPositions can return null. Also whittled down more nocommits – down to 53 now!
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438140 ]
          Hide
          Michael McCandless added a comment -

          Back compat test for MTQ looks good Robert... thanks!

          Show
          Michael McCandless added a comment - Back compat test for MTQ looks good Robert... thanks!
          Hide
          Michael McCandless added a comment -

          Down to 15 nocommits!

          Show
          Michael McCandless added a comment - Down to 15 nocommits!
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438153 ]
          Hide
          Michael McCandless added a comment -

          Same patch, just fixes the Java 1.6 only changes (adding @Override to interface).

          Show
          Michael McCandless added a comment - Same patch, just fixes the Java 1.6 only changes (adding @Override to interface).
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438157 ]
          Hide
          Michael McCandless added a comment -

          New rev, just a few changes:

          • Rename BytesRef.toBytesString -> toString (thanks Robert!)
          • No more BytesRef.Comparator – just use java.util.Comparator<BytesRef> (thanks Uwe!)
          • Use Set<String> not Collection<String> when asking Codec for its files/extensions (thanks Mark!)

          I'll commit (on flex branch) sometime today. Making me nervous carrying such a large patch...

          Show
          Michael McCandless added a comment - New rev, just a few changes: Rename BytesRef.toBytesString -> toString (thanks Robert!) No more BytesRef.Comparator – just use java.util.Comparator<BytesRef> (thanks Uwe!) Use Set<String> not Collection<String> when asking Codec for its files/extensions (thanks Mark!) I'll commit (on flex branch) sometime today. Making me nervous carrying such a large patch...
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438178 ]
          Hide
          Michael McCandless added a comment -

          More nocommit reductions and other fixes:

          • Rename Codecs -> CodecProvider
          • Try to optimize flex API on pre-flex index (don't clone the
            SegmentTermEnum during seek)
          • Use SegmentReadState class when getting fieldsProducer (matches
            SegmentWriteState)
          • Cuts over to a new bulk-read API on DocsEnum that's designed to
            let int block codecs provide direct access to the int[] they have
            (saves extra copy).
          • Down to 9 nocommits!!
          Show
          Michael McCandless added a comment - More nocommit reductions and other fixes: Rename Codecs -> CodecProvider Try to optimize flex API on pre-flex index (don't clone the SegmentTermEnum during seek) Use SegmentReadState class when getting fieldsProducer (matches SegmentWriteState) Cuts over to a new bulk-read API on DocsEnum that's designed to let int block codecs provide direct access to the int[] they have (saves extra copy). Down to 9 nocommits!!
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438448 ]
          Hide
          Michael McCandless added a comment -

          Forgot to add SegmentReadState.java

          Show
          Michael McCandless added a comment - Forgot to add SegmentReadState.java
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438450 ]
          Hide
          Michael McCandless added a comment -

          Also adds reuse to pre-flex API when getting a new TermsEnum.

          Show
          Michael McCandless added a comment - Also adds reuse to pre-flex API when getting a new TermsEnum.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12438454 ]
          Hide
          Michael McCandless added a comment -

          Thanks Shai!

          Show
          Michael McCandless added a comment - Thanks Shai!
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Michael McCandless added a comment -

          Duh – wrong issue! I only wish....

          Show
          Michael McCandless added a comment - Duh – wrong issue! I only wish....
          Michael McCandless made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Michael McCandless added a comment -

          Attached patch, eliminating all remaining nocommits on flex branch! I turned most of them into TODOs

          Show
          Michael McCandless added a comment - Attached patch, eliminating all remaining nocommits on flex branch! I turned most of them into TODOs
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12439644 ]
          Hide
          Michael McCandless added a comment -

          I'm benchmarking flex vs trunk, but uncovered a strange performance loss with WildcardQuery. I'm attaching the python wrapper around contrib/benchmark that I'm using. Hopefully this is something silly...

          You have to edit flexBench.py, specificaly the TRUNK_DIR and FLEX_DIR must point to the .../contrib/benchmark of each source area, and you have to edit the WIKI_LINE_FILE and/or WIKI_FILE (I think WIKI_LINE_FILE can be None in which case it should (but I haven't tested recently!) fallback to parsing the .xml.bz2 wikipedia export).

          I'll first build an index of the first 5M wikipedia docs, once for flex and once for trunk, and then run the test queries. It also tests the "flex API on trunk index" case, to test perf of the flex emulation layer... this layer is looking a bit slowish now but I'm not sure how much we can do to speed it up...

          Run like this:

          python -u flexBench.py -run test
          

          I have it set to only test only the wildcard query uni*t right now... and I'm getting this result:

          JAVA:
          java version "1.6.0_17"
          Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
          Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
          
          
          OS:
          Linux centos 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
          
          Index /x/lucene/flex.work.wiki.nd5M already exists...
          Index /x/lucene/trunk.work.wiki.nd5M already exists...
          Index /x/lucene/flex.work.random.nd5M already exists...
          Index /x/lucene/trunk.work.random.nd5M already exists...
          
          RUN: source=wiki query=un*t sort=None
            run trunk...
              cd /root/src/clean/lucene/contrib/benchmark
              log: /root/src/clean/lucene/contrib/benchmark/logs/trunk.0
              62.49 QPS
            run flex on trunk index...
              cd /root/src/flex.clean/contrib/benchmark
              log: /root/src/flex.clean/contrib/benchmark/logs/flexOnTrunk.1
              25.87 QPS [-58.6% worse]
            run flex on flex index...
              cd /root/src/flex.clean/contrib/benchmark
              log: /root/src/flex.clean/contrib/benchmark/logs/flexOnFlex.2
              39.30 QPS [-37.1% worse]
            124623 hits
          

          Other queries I've tested look OK so far...

          Show
          Michael McCandless added a comment - I'm benchmarking flex vs trunk, but uncovered a strange performance loss with WildcardQuery. I'm attaching the python wrapper around contrib/benchmark that I'm using. Hopefully this is something silly... You have to edit flexBench.py, specificaly the TRUNK_DIR and FLEX_DIR must point to the .../contrib/benchmark of each source area, and you have to edit the WIKI_LINE_FILE and/or WIKI_FILE (I think WIKI_LINE_FILE can be None in which case it should (but I haven't tested recently!) fallback to parsing the .xml.bz2 wikipedia export). I'll first build an index of the first 5M wikipedia docs, once for flex and once for trunk, and then run the test queries. It also tests the "flex API on trunk index" case, to test perf of the flex emulation layer... this layer is looking a bit slowish now but I'm not sure how much we can do to speed it up... Run like this: python -u flexBench.py -run test I have it set to only test only the wildcard query uni*t right now... and I'm getting this result: JAVA: java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode) OS: Linux centos 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux Index /x/lucene/flex.work.wiki.nd5M already exists... Index /x/lucene/trunk.work.wiki.nd5M already exists... Index /x/lucene/flex.work.random.nd5M already exists... Index /x/lucene/trunk.work.random.nd5M already exists... RUN: source=wiki query=un*t sort=None run trunk... cd /root/src/clean/lucene/contrib/benchmark log: /root/src/clean/lucene/contrib/benchmark/logs/trunk.0 62.49 QPS run flex on trunk index... cd /root/src/flex.clean/contrib/benchmark log: /root/src/flex.clean/contrib/benchmark/logs/flexOnTrunk.1 25.87 QPS [-58.6% worse] run flex on flex index... cd /root/src/flex.clean/contrib/benchmark log: /root/src/flex.clean/contrib/benchmark/logs/flexOnFlex.2 39.30 QPS [-37.1% worse] 124623 hits Other queries I've tested look OK so far...
          Michael McCandless made changes -
          Attachment benchUtil.py [ 12439953 ]
          Attachment flexBench.py [ 12439954 ]
          Hide
          Michael McCandless added a comment -

          Towards wrapping up flex, I ran a set of tests to benchmark flex's
          search performance vs trunk.

          All tests are on a 5M doc Wikipedia index, best qps of 5 runs where
          each run runs the query for 5.0 seconds. Env is:

          JAVA:
          java version "1.6.0_17"
          Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
          Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
          
          OS:
          Linux centos 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
          

          First table compares trunk against "flex on flex", ie, a flex index
          (fully reindexed after upgrading to flex):

          Query Tot hits Sort QPS trunk QPS new Pct change
          1 591225   68.36 80.64 18.0%
              title 64.12 68.53 6.9%
          1 OR 2 953081   19.35 20.80 7.5%
              title 16.50 17.48 5.9%
          1 OR 2 OR 3 1131679   14.37 15.50 7.9%
              title 12.42 13.26 6.8%
          1 OR 2 OR 3 OR 4 1266805   10.94 12.76 16.6%
              title 10.36 11.05 6.7%
          1 AND 2 239303   21.19 22.32 5.3%
              title 22.77 24.25 6.5%
          1 AND 2 AND 3 109513   18.83 19.17 1.8%
              title 19.30 20.06 3.9%
          1 AND 2 AND 3 AND 4 60795   16.21 17.51 8.0%
              title 16.75 18.29 9.2%
          "united states" 528845   7.54 8.54 13.3%
              title 7.36 8.14 10.6%
          "united states of america" 12144   20.64 21.48 4.1%
              title 20.45 21.06 3.0%
          un* 2250238   9.31 11.54 24.0%
              title 8.42 10.96 30.2%
          *ent 2482701   0.32 0.92 187.5%
              title 0.32 0.91 184.4%
          u*t 169192   18.53 47.97 158.9%
              title 17.26 40.10 132.3%
          uni* 1308332   18.54 23.49 26.7%
              title 16.28 20.02 23.0%
          un*t 124623   62.13 105.23 69.4%
              title 50.38 74.99 48.8%
          ?t 554722   0.51 29.31 5647.1%
              title 0.51 26.25 5047.1%
          ??t 1605437   0.60 6.69 1015.0%
              title 0.60 6.22 936.7%
          ???t 3100067   0.54 1.92 255.6%
              title 0.53 1.89 256.6%
          ????t 2973045   0.51 0.71 39.2%
              title 0.51 0.70 37.3%
          ?????t 2323871   0.51 0.39 -23.5%
              title 0.50 0.39 -22.0%
          ??????t 2459025   0.49 0.31 -36.7%
              title 0.48 0.15 -68.7%
          un?t 86664   92.45 241.46 161.2%
              title 72.59 151.28 108.4%
          un??t 2860   222.11 408.52 83.9%
              title 220.91 405.84 83.7%
          un???t 5828   117.38 99.64 -15.1%
              title 111.47 98.64 -11.5%
          un????t 1426   207.03 100.60 -51.4%
              title 207.23 101.36 -51.1%
          united~0.5 872873   0.35 0.31 -11.4%
              title 0.35 0.31 -11.4%
          united~0.6 764041   0.46 5.22 1034.8%
              title 0.45 5.00 1011.1%
          united~0.7 695756   0.59 21.19 3491.5%
              title 0.60 19.10 3083.3%
          united~0.8 693134   0.59 21.44 3533.9%
              title 0.58 19.55 3270.7%
          united~0.9 692299   57.06 67.80 18.8%
              title 55.28 57.87 4.7%

          I also ran the same queries through, but this time using the trunk
          (pre-flex) index with flex, ie to perf test the "flex on pre-flex"
          emulation layer. This is the initial experience users will see if
          they upgrade to flex but don't reindex:

          Query Tot hits Sort QPS trunk QPS new Pct change
          1 591225   68.36 66.91 -2.1%
              title 64.12 58.47 -8.8%
          1 OR 2 953081   19.35 19.06 -1.5%
              title 16.50 16.03 -2.8%
          1 OR 2 OR 3 1131679   14.37 14.14 -1.6%
              title 12.42 12.11 -2.5%
          1 OR 2 OR 3 OR 4 1266805   10.94 11.61 6.1%
              title 10.36 10.04 -3.1%
          1 AND 2 239303   21.19 21.12 -0.3%
              title 22.77 22.46 -1.4%
          1 AND 2 AND 3 109513   18.83 18.81 -0.1%
              title 19.30 19.29 -0.1%
          1 AND 2 AND 3 AND 4 60795   16.21 17.18 6.0%
              title 16.75 17.46 4.2%
          "united states" 528845   7.54 7.63 1.2%
              title 7.36 7.12 -3.3%
          "united states of america" 12144   20.64 19.33 -6.3%
              title 20.45 19.50 -4.6%
          un* 2250238   9.31 9.79 5.2%
              title 8.42 9.65 14.6%
          *ent 2482701   0.32 0.45 40.6%
              title 0.32 0.45 40.6%
          u*t 169192   18.53 24.75 33.6%
              title 17.26 21.96 27.2%
          uni* 1308332   18.54 19.39 4.6%
              title 16.28 15.86 -2.6%
          un*t 124623   62.13 59.73 -3.9%
              title 50.38 48.51 -3.7%
          ?t 554722   0.51 23.65 4537.3%
              title 0.51 21.42 4100.0%
          ??t 1605437   0.60 5.13 755.0%
              title 0.60 4.61 668.3%
          ???t 3100067   0.54 1.28 137.0%
              title 0.53 1.24 134.0%
          ????t 2973045   0.51 0.55 7.8%
              title 0.51 0.54 5.9%
          ?????t 2323871   0.51 0.29 -43.1%
              title 0.50 0.29 -42.0%
          ??????t 2459025   0.49 0.18 -63.3%
              title 0.48 0.21 -56.2%
          un?t 86664   92.45 202.48 119.0%
              title 72.59 134.55 85.4%
          un??t 2860   222.11 187.05 -15.8%
              title 220.91 186.81 -15.4%
          un???t 5828   117.38 69.30 -41.0%
              title 111.47 68.59 -38.5%
          un????t 1426   207.03 60.98 -70.5%
              title 207.23 60.62 -70.7%
          united~0.5 872873   0.35 0.23 -34.3%
              title 0.35 0.23 -34.3%
          united~0.6 764041   0.46 3.84 734.8%
              title 0.45 3.76 735.6%
          united~0.7 695756   0.59 17.45 2857.6%
              title 0.60 15.53 2488.3%
          united~0.8 693134   0.59 17.56 2876.3%
              title 0.58 15.97 2653.4%
          united~0.9 692299   57.06 56.02 -1.8%
              title 55.28 49.26 -10.9%

          There are alot of numbers to absorb... but here's my take:

          • Flex is generally faster.
          • Fuzzy queries and certain wildcard queries (using AutomatonQuery)
            are insanely faster.
          • There are certain specific wildcard corner cases where we are
            slower, but these are likely rarely used in practice (many ?'s
            followed by a suffix).
          • Flex API on a trunk index does take a perf hit but it looks contained enough
            that we don't need to spend any time optimizing that emulation layer...

          I also ran an indexing test (index first 10M docs of wikipedia) and
          flex and trunk had similar times.

          I think net/net we are good to land flex!

          Show
          Michael McCandless added a comment - Towards wrapping up flex, I ran a set of tests to benchmark flex's search performance vs trunk. All tests are on a 5M doc Wikipedia index, best qps of 5 runs where each run runs the query for 5.0 seconds. Env is: JAVA: java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode) OS: Linux centos 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux First table compares trunk against "flex on flex", ie, a flex index (fully reindexed after upgrading to flex): Query Tot hits Sort QPS trunk QPS new Pct change 1 591225   68.36 80.64 18.0%     title 64.12 68.53 6.9% 1 OR 2 953081   19.35 20.80 7.5%     title 16.50 17.48 5.9% 1 OR 2 OR 3 1131679   14.37 15.50 7.9%     title 12.42 13.26 6.8% 1 OR 2 OR 3 OR 4 1266805   10.94 12.76 16.6%     title 10.36 11.05 6.7% 1 AND 2 239303   21.19 22.32 5.3%     title 22.77 24.25 6.5% 1 AND 2 AND 3 109513   18.83 19.17 1.8%     title 19.30 20.06 3.9% 1 AND 2 AND 3 AND 4 60795   16.21 17.51 8.0%     title 16.75 18.29 9.2% "united states" 528845   7.54 8.54 13.3%     title 7.36 8.14 10.6% "united states of america" 12144   20.64 21.48 4.1%     title 20.45 21.06 3.0% un* 2250238   9.31 11.54 24.0%     title 8.42 10.96 30.2% *ent 2482701   0.32 0.92 187.5%     title 0.32 0.91 184.4% u*t 169192   18.53 47.97 158.9%     title 17.26 40.10 132.3% uni* 1308332   18.54 23.49 26.7%     title 16.28 20.02 23.0% un*t 124623   62.13 105.23 69.4%     title 50.38 74.99 48.8% ?t 554722   0.51 29.31 5647.1%     title 0.51 26.25 5047.1% ??t 1605437   0.60 6.69 1015.0%     title 0.60 6.22 936.7% ???t 3100067   0.54 1.92 255.6%     title 0.53 1.89 256.6% ????t 2973045   0.51 0.71 39.2%     title 0.51 0.70 37.3% ?????t 2323871   0.51 0.39 -23.5%     title 0.50 0.39 -22.0% ??????t 2459025   0.49 0.31 -36.7%     title 0.48 0.15 -68.7% un?t 86664   92.45 241.46 161.2%     title 72.59 151.28 108.4% un??t 2860   222.11 408.52 83.9%     title 220.91 405.84 83.7% un???t 5828   117.38 99.64 -15.1%     title 111.47 98.64 -11.5% un????t 1426   207.03 100.60 -51.4%     title 207.23 101.36 -51.1% united~0.5 872873   0.35 0.31 -11.4%     title 0.35 0.31 -11.4% united~0.6 764041   0.46 5.22 1034.8%     title 0.45 5.00 1011.1% united~0.7 695756   0.59 21.19 3491.5%     title 0.60 19.10 3083.3% united~0.8 693134   0.59 21.44 3533.9%     title 0.58 19.55 3270.7% united~0.9 692299   57.06 67.80 18.8%     title 55.28 57.87 4.7% I also ran the same queries through, but this time using the trunk (pre-flex) index with flex, ie to perf test the "flex on pre-flex" emulation layer. This is the initial experience users will see if they upgrade to flex but don't reindex: Query Tot hits Sort QPS trunk QPS new Pct change 1 591225   68.36 66.91 -2.1%     title 64.12 58.47 -8.8% 1 OR 2 953081   19.35 19.06 -1.5%     title 16.50 16.03 -2.8% 1 OR 2 OR 3 1131679   14.37 14.14 -1.6%     title 12.42 12.11 -2.5% 1 OR 2 OR 3 OR 4 1266805   10.94 11.61 6.1%     title 10.36 10.04 -3.1% 1 AND 2 239303   21.19 21.12 -0.3%     title 22.77 22.46 -1.4% 1 AND 2 AND 3 109513   18.83 18.81 -0.1%     title 19.30 19.29 -0.1% 1 AND 2 AND 3 AND 4 60795   16.21 17.18 6.0%     title 16.75 17.46 4.2% "united states" 528845   7.54 7.63 1.2%     title 7.36 7.12 -3.3% "united states of america" 12144   20.64 19.33 -6.3%     title 20.45 19.50 -4.6% un* 2250238   9.31 9.79 5.2%     title 8.42 9.65 14.6% *ent 2482701   0.32 0.45 40.6%     title 0.32 0.45 40.6% u*t 169192   18.53 24.75 33.6%     title 17.26 21.96 27.2% uni* 1308332   18.54 19.39 4.6%     title 16.28 15.86 -2.6% un*t 124623   62.13 59.73 -3.9%     title 50.38 48.51 -3.7% ?t 554722   0.51 23.65 4537.3%     title 0.51 21.42 4100.0% ??t 1605437   0.60 5.13 755.0%     title 0.60 4.61 668.3% ???t 3100067   0.54 1.28 137.0%     title 0.53 1.24 134.0% ????t 2973045   0.51 0.55 7.8%     title 0.51 0.54 5.9% ?????t 2323871   0.51 0.29 -43.1%     title 0.50 0.29 -42.0% ??????t 2459025   0.49 0.18 -63.3%     title 0.48 0.21 -56.2% un?t 86664   92.45 202.48 119.0%     title 72.59 134.55 85.4% un??t 2860   222.11 187.05 -15.8%     title 220.91 186.81 -15.4% un???t 5828   117.38 69.30 -41.0%     title 111.47 68.59 -38.5% un????t 1426   207.03 60.98 -70.5%     title 207.23 60.62 -70.7% united~0.5 872873   0.35 0.23 -34.3%     title 0.35 0.23 -34.3% united~0.6 764041   0.46 3.84 734.8%     title 0.45 3.76 735.6% united~0.7 695756   0.59 17.45 2857.6%     title 0.60 15.53 2488.3% united~0.8 693134   0.59 17.56 2876.3%     title 0.58 15.97 2653.4% united~0.9 692299   57.06 56.02 -1.8%     title 55.28 49.26 -10.9% There are alot of numbers to absorb... but here's my take: Flex is generally faster. Fuzzy queries and certain wildcard queries (using AutomatonQuery) are insanely faster. There are certain specific wildcard corner cases where we are slower, but these are likely rarely used in practice (many ?'s followed by a suffix). Flex API on a trunk index does take a perf hit but it looks contained enough that we don't need to spend any time optimizing that emulation layer... I also ran an indexing test (index first 10M docs of wikipedia) and flex and trunk had similar times. I think net/net we are good to land flex!
          Hide
          Robert Muir added a comment -

          I think net/net we are good to land flex!

          +1. The tests have been passing for some time now, and Solr tests pass too.

          It would be nice to look at merging flex into the trunk soon so that it gets more exposure.

          Show
          Robert Muir added a comment - I think net/net we are good to land flex! +1. The tests have been passing for some time now, and Solr tests pass too. It would be nice to look at merging flex into the trunk soon so that it gets more exposure.
          Hide
          Michael McCandless added a comment -

          Small fixes for flex – fixes SpanTermQuerty to throw exc if it's run on a field that omitTFAPs (matches PhraseQuery), fixes all jdoc warnings, spells out back compat breaks in changes.

          Show
          Michael McCandless added a comment - Small fixes for flex – fixes SpanTermQuerty to throw exc if it's run on a field that omitTFAPs (matches PhraseQuery), fixes all jdoc warnings, spells out back compat breaks in changes.
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12440237 ]
          Hide
          Michael Busch added a comment -

          Flex is generally faster.

          Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard?

          How's the indexing performance?

          I think net/net we are good to land flex!

          +1! Even if there are still small things to change/fix I think it makes sense to merge with trunk now.

          Show
          Michael Busch added a comment - Flex is generally faster. Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard? How's the indexing performance? I think net/net we are good to land flex! +1! Even if there are still small things to change/fix I think it makes sense to merge with trunk now.
          Hide
          Robert Muir added a comment -

          There are certain specific wildcard corner cases where we are
          slower, but these are likely rarely used in practice (many ?'s
          followed by a suffix).

          I think it would be good to fix this in the future, but I certainly think its a rare case.
          The problem is similar to where an SQL engine decides to just table-scan instead
          of using a btree index... In this case we are trying to be too smart and just seek
          to the correct term based on the query instead of scanning, but this causes too
          many seeks.

          At the same time, you have to be careful or you make the wrong decision
          and give O(n) performance instead of O(log n).

          In my opinion it would be better to think in the future how we can improve lucene
          in the following ways:

          • The term dictionary should be more "DFA-friendly", e.g. the whole concept of TermsEnum is wrong,
            linear enumeration of terms is inefficient for any big index. we should get away from it.
          • Instead it would be nice to think of the index like an FST, and instead of enumerating things and filtering them,
            we provide a DFA and enumerate the transduced results.
          • We need to eliminate the UTF-8/UTF-16 impedence mismatch which causes so much
            complication and unnecessary hairy code today.

          All this being said, I think flex is a great move forward for multitermqueries, at least
          we have a seeking-friendly API! One step at a time.

          Show
          Robert Muir added a comment - There are certain specific wildcard corner cases where we are slower, but these are likely rarely used in practice (many ?'s followed by a suffix). I think it would be good to fix this in the future, but I certainly think its a rare case. The problem is similar to where an SQL engine decides to just table-scan instead of using a btree index... In this case we are trying to be too smart and just seek to the correct term based on the query instead of scanning, but this causes too many seeks. At the same time, you have to be careful or you make the wrong decision and give O(n) performance instead of O(log n). In my opinion it would be better to think in the future how we can improve lucene in the following ways: The term dictionary should be more "DFA-friendly", e.g. the whole concept of TermsEnum is wrong, linear enumeration of terms is inefficient for any big index. we should get away from it. Instead it would be nice to think of the index like an FST, and instead of enumerating things and filtering them, we provide a DFA and enumerate the transduced results. We need to eliminate the UTF-8/UTF-16 impedence mismatch which causes so much complication and unnecessary hairy code today. All this being said, I think flex is a great move forward for multitermqueries, at least we have a seeking-friendly API! One step at a time.
          Hide
          Michael McCandless added a comment -

          Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard?

          The AutomatonQuery (for fuzzy/wildcard) gives the biggest gains Other MTQs (prefix) see gains I think because of more efficient terms enum. The TermQuery speedup surprises me – that can't be a terms dict thing (just one lookup); i'm not sure offhand why it's faster. That code is not very different than trunk.

          How's the indexing performance?

          Unchanged – I indexed first 10M docs of wikipedia and the times were nearly identical.

          Show
          Michael McCandless added a comment - Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard? The AutomatonQuery (for fuzzy/wildcard) gives the biggest gains Other MTQs (prefix) see gains I think because of more efficient terms enum. The TermQuery speedup surprises me – that can't be a terms dict thing (just one lookup); i'm not sure offhand why it's faster. That code is not very different than trunk. How's the indexing performance? Unchanged – I indexed first 10M docs of wikipedia and the times were nearly identical.
          Hide
          Michael McCandless added a comment -

          The term dictionary should be more "DFA-friendly", e.g. the whole concept of TermsEnum is wrong,
          linear enumeration of terms is inefficient for any big index. we should get away from it.
          Instead it would be nice to think of the index like an FST, and instead of enumerating things and filtering them,
          we provide a DFA and enumerate the transduced results.
          We need to eliminate the UTF-8/UTF-16 impedence mismatch which causes so much
          complication and unnecessary hairy code today.

          +1 – we already see these limitations now in making AutomatonQuery consume the straight enum. If we flipped the problem around (you pass a DFA to the codec and it does the intersection & enums the result), and we used byte-based DFAs, I think we'd get a good speedup.

          Show
          Michael McCandless added a comment - The term dictionary should be more "DFA-friendly", e.g. the whole concept of TermsEnum is wrong, linear enumeration of terms is inefficient for any big index. we should get away from it. Instead it would be nice to think of the index like an FST, and instead of enumerating things and filtering them, we provide a DFA and enumerate the transduced results. We need to eliminate the UTF-8/UTF-16 impedence mismatch which causes so much complication and unnecessary hairy code today. +1 – we already see these limitations now in making AutomatonQuery consume the straight enum. If we flipped the problem around (you pass a DFA to the codec and it does the intersection & enums the result), and we used byte-based DFAs, I think we'd get a good speedup.
          Hide
          Michael McCandless added a comment -

          Flex has some trouble making > 2B terms – attached patch creates such an index. I'm still getting to the bottom of it...

          Show
          Michael McCandless added a comment - Flex has some trouble making > 2B terms – attached patch creates such an index. I'm still getting to the bottom of it...
          Michael McCandless made changes -
          Attachment Make2BTermsIndex.java [ 12440414 ]
          Hide
          Michael McCandless added a comment -

          Patch fixes standard codec's terms dict to handle > 2B terms; I also strengthened CheckIndex to verify that .ord() of the TermsEnum always returns the right result, for codecs that implement .ord. I'll commit shortly...

          Show
          Michael McCandless added a comment - Patch fixes standard codec's terms dict to handle > 2B terms; I also strengthened CheckIndex to verify that .ord() of the TermsEnum always returns the right result, for codecs that implement .ord. I'll commit shortly...
          Michael McCandless made changes -
          Attachment LUCENE-2111.patch [ 12440488 ]
          Michael McCandless made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Fix Version/s 4.0.0 [ 12314822 ]
          Fix Version/s 3.1 [ 12314025 ]
          Uwe Schindler made changes -
          Fix Version/s 4.0 [ 12314025 ]
          Fix Version/s 3.1 [ 12314822 ]
          Uwe Schindler made changes -
          Affects Version/s 4.0 [ 12314025 ]
          Affects Version/s Flex Branch [ 12314439 ]
          Mark Thomas made changes -
          Workflow jira [ 12483761 ] Default workflow, editable Closed status [ 12564675 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564675 ] jira [ 12584333 ]
          Gavin made changes -
          Link This issue blocks LUCENE-1606 [ LUCENE-1606 ]
          Gavin made changes -
          Link This issue is depended upon by LUCENE-1606 [ LUCENE-1606 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development