Lucene - Core
  1. Lucene - Core
  2. LUCENE-2694

MTQ rewrite + weight/scorer init should be single pass

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff of LUCENE-2690 (see the hacked patch on that issue)...

      Once we fix MTQ rewrite to be per-segment, we should take it further and make weight/scorer init also run in the same single pass as rewrite.

      1. LUCENE-2694_hack.patch
        92 kB
        Robert Muir
      2. LUCENE-2694.patch
        85 kB
        Simon Willnauer
      3. LUCENE-2694.patch
        84 kB
        Simon Willnauer
      4. LUCENE-2694.patch
        83 kB
        Simon Willnauer
      5. LUCENE-2694.patch
        84 kB
        Simon Willnauer
      6. LUCENE-2694.patch
        94 kB
        Simon Willnauer
      7. LUCENE-2694.patch
        95 kB
        Simon Willnauer
      8. LUCENE-2694.patch
        92 kB
        Simon Willnauer
      9. LUCENE-2694.patch
        92 kB
        Simon Willnauer
      10. LUCENE-2694.patch
        82 kB
        Simon Willnauer
      11. LUCENE-2694.patch
        78 kB
        Simon Willnauer
      12. LUCENE-2694.patch
        73 kB
        Simon Willnauer
      13. LUCENE-2694.patch
        39 kB
        Simon Willnauer
      14. LUCENE-2694-FTE.patch
        1 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          I started on this issue with a rough idea and wanna upload it to get some initial feedback. The idea is to provide access to TermState via the TermsEnum Attribute API to eventually use the TermState inside of TermQuery to prevent a second lookup as well as term cache. Its very rough and I tried to go the least intrusive way as possible so the implementation is mainly to show the main principles.

          feedback welcome.

          Show
          Simon Willnauer added a comment - I started on this issue with a rough idea and wanna upload it to get some initial feedback. The idea is to provide access to TermState via the TermsEnum Attribute API to eventually use the TermState inside of TermQuery to prevent a second lookup as well as term cache. Its very rough and I tried to go the least intrusive way as possible so the implementation is mainly to show the main principles. feedback welcome.
          Hide
          Uwe Schindler added a comment - - edited

          Havent looked closely into the patch (still need to understand the whole thing), just some comments from attribute policeman in general:

          • The TermStateAttributeImpl.copyTo should throw ClassCastEx if attributes are not conform (compare other impls), so the if statement should not be there. AttributeSource takes care of copying. This is not used, but for completeness.
          • the convenience addClause() method in abstract base class should be final! Else you could incorrectly override the wrong one. We already have code duplication in your patch because of this. When you make it final you will see!
          • why is the attribute using a SetOnce? Attributes generally should be modifiable multiple times. Now you have to call clear() first. This may change in future when we have set-once attributes, but for now that violates the contract
          • Is the docFreq no already part of the state so TermCollectingRewrite does not need to expose it separately?
          • What happens in the term collectors when the same term with different states are merged in the PQ/TermsHash/...?
          Show
          Uwe Schindler added a comment - - edited Havent looked closely into the patch (still need to understand the whole thing), just some comments from attribute policeman in general: The TermStateAttributeImpl.copyTo should throw ClassCastEx if attributes are not conform (compare other impls), so the if statement should not be there. AttributeSource takes care of copying. This is not used, but for completeness. the convenience addClause() method in abstract base class should be final! Else you could incorrectly override the wrong one. We already have code duplication in your patch because of this. When you make it final you will see! why is the attribute using a SetOnce? Attributes generally should be modifiable multiple times. Now you have to call clear() first. This may change in future when we have set-once attributes, but for now that violates the contract Is the docFreq no already part of the state so TermCollectingRewrite does not need to expose it separately? What happens in the term collectors when the same term with different states are merged in the PQ/TermsHash/...?
          Hide
          Michael McCandless added a comment -

          I would rather not use an attribute here – this is a very core thing so I think we should extend TermsEnum API.

          We can just add a getTermState and a seek(TermState) to terms enum (and, actually, remove cacheTermState)?

          Also, then we wouldn't need to add the get docs/positions enum methods to TermState.

          Show
          Michael McCandless added a comment - I would rather not use an attribute here – this is a very core thing so I think we should extend TermsEnum API. We can just add a getTermState and a seek(TermState) to terms enum (and, actually, remove cacheTermState)? Also, then we wouldn't need to add the get docs/positions enum methods to TermState.
          Hide
          Simon Willnauer added a comment -

          Uwe I agree with your comments - I just didn't pay lots of attention to it since I was rather interested in feedback for the idea....

          I actually think that we should move the getTermState into the termsEnum though rather than using an attribute but for now that was easier to implement though.

          What happens in the term collectors when the same term with different states are merged in the PQ/TermsHash/...?

          That one should work since I map the TermState per reader in a PerReaderTermState though.

          simon

          Show
          Simon Willnauer added a comment - Uwe I agree with your comments - I just didn't pay lots of attention to it since I was rather interested in feedback for the idea.... I actually think that we should move the getTermState into the termsEnum though rather than using an attribute but for now that was easier to implement though. What happens in the term collectors when the same term with different states are merged in the PQ/TermsHash/...? That one should work since I map the TermState per reader in a PerReaderTermState though. simon
          Hide
          Simon Willnauer added a comment -

          We can just add a getTermState and a seek(TermState) to terms enum (and, actually, remove cacheTermState)?

          Yes! please! - i used the attribute to move faster here since it didn't require to change the API really.

          Also, then we wouldn't need to add the get docs/positions enum methods to TermState.

          yeah see above....

          i will workout a cleaner patch

          Show
          Simon Willnauer added a comment - We can just add a getTermState and a seek(TermState) to terms enum (and, actually, remove cacheTermState)? Yes! please! - i used the attribute to move faster here since it didn't require to change the API really. Also, then we wouldn't need to add the get docs/positions enum methods to TermState. yeah see above.... i will workout a cleaner patch
          Hide
          Simon Willnauer added a comment -

          next iteration. This patch removes the term cache completely and exposes getTermState via TermsEnum. Terms, TermsEnum and IndexReader can now obtain a DocEnum directly by passing in a TermState.

          I need to run some benchmarks on an index with several segments on a index with two segments its just slightly faster...
          mike do you have one ready?

          Show
          Simon Willnauer added a comment - next iteration. This patch removes the term cache completely and exposes getTermState via TermsEnum. Terms, TermsEnum and IndexReader can now obtain a DocEnum directly by passing in a TermState. I need to run some benchmarks on an index with several segments on a index with two segments its just slightly faster... mike do you have one ready?
          Hide
          Michael McCandless added a comment -

          Phew that was fast!

          Wow, you nuked the terms dict cache Nice!

          Though it makes me a bit nervous... like there'll always be a risk
          we've missed some path through Lucene that does two lookups... And,
          even for external reasons (eg same query arrives to Lucene, looking
          for next page or something), the cache is useful.

          EG, a straight TermQuery (not spawned by MTQ) is now hitting the terms
          dict twice. Once inside Sim.idfExplain, where it calls
          searcher.docFreq(term), and then again to pull the scorers per sub
          reader. Probably, TermQuery should pull the PerReaderTermState, up
          front, if it wasn't already handed it? And then pass the docFreq to
          Sim.idfExplain.

          Should we add a PerReaderTermState.docFreq(), which just sums up
          across all subs?

          Does TermState really need field()? Seems wasteful to have to store
          that... eg an MTQ will store many TermStates against the same field.
          I think we should keep TermState lean.

          Also, I think it shouldn't need that clone method?

          I think instead of duplicating docs/docsAndPositions (and soon
          bulkPostings) on TermsEnum, once for TermState and once without, we
          should just add a seek(TermState)? And then the single
          docs/docsAndPositions/etc. method can be used to get the enum for that
          term. (Likewise for Terms) Also, we should remove docFreq and ord
          from TermsEnum since you should get it from TermState?

          I think IndexReader can offer the sugar methods (that take either
          BytesRef term or String field + TermState state).

          Also: I tried to run the benchmark on beast but unfortunately there's
          a bug somewhere (even though Lucene core tests pass) – I see
          different results for some fuzzy queries.

          Nice work!! Getting to single term lookup for all queries will be awesome!

          Show
          Michael McCandless added a comment - Phew that was fast! Wow, you nuked the terms dict cache Nice! Though it makes me a bit nervous... like there'll always be a risk we've missed some path through Lucene that does two lookups... And, even for external reasons (eg same query arrives to Lucene, looking for next page or something), the cache is useful. EG, a straight TermQuery (not spawned by MTQ) is now hitting the terms dict twice. Once inside Sim.idfExplain, where it calls searcher.docFreq(term), and then again to pull the scorers per sub reader. Probably, TermQuery should pull the PerReaderTermState, up front, if it wasn't already handed it? And then pass the docFreq to Sim.idfExplain. Should we add a PerReaderTermState.docFreq(), which just sums up across all subs? Does TermState really need field()? Seems wasteful to have to store that... eg an MTQ will store many TermStates against the same field. I think we should keep TermState lean. Also, I think it shouldn't need that clone method? I think instead of duplicating docs/docsAndPositions (and soon bulkPostings) on TermsEnum, once for TermState and once without, we should just add a seek(TermState)? And then the single docs/docsAndPositions/etc. method can be used to get the enum for that term. (Likewise for Terms) Also, we should remove docFreq and ord from TermsEnum since you should get it from TermState? I think IndexReader can offer the sugar methods (that take either BytesRef term or String field + TermState state). Also: I tried to run the benchmark on beast but unfortunately there's a bug somewhere (even though Lucene core tests pass) – I see different results for some fuzzy queries. Nice work!! Getting to single term lookup for all queries will be awesome!
          Hide
          Michael McCandless added a comment -

          BTW, one use case where this patch should show a sizable performance gain is a "primary key lookup" against a multi-segment index.

          So this'd be a TermQuery against eg an "id" field, where the app knows at most one doc contains the requested value.

          Today, we pay a high price for the 2nd pass, because we do not cache a miss against a segment. So on the first pass (computing IDF) we know which segment has a match and which segments do not, but then on the 2nd pass we re-pay the lookup cost against all the misses (the single segment w/ the hit will be cached).

          So this ought to be a big win... especially once we combine this w/ the speedups from pulsing codec (we still need to cutover to this as a default) then primary key lookups in a Lucene index will be much faster...

          Show
          Michael McCandless added a comment - BTW, one use case where this patch should show a sizable performance gain is a "primary key lookup" against a multi-segment index. So this'd be a TermQuery against eg an "id" field, where the app knows at most one doc contains the requested value. Today, we pay a high price for the 2nd pass, because we do not cache a miss against a segment. So on the first pass (computing IDF) we know which segment has a match and which segments do not, but then on the 2nd pass we re-pay the lookup cost against all the misses (the single segment w/ the hit will be cached). So this ought to be a big win... especially once we combine this w/ the speedups from pulsing codec (we still need to cutover to this as a default) then primary key lookups in a Lucene index will be much faster...
          Hide
          Simon Willnauer added a comment -

          Attaching current state - all test pass for me and luceneutils brings consistent results with trunk.

          
                         Query   QPS trunkQPS termstate  Pct diff
                      unit~2.0       14.70       14.39     -2.1%
                    united~2.0        6.91        6.83     -1.1%
                    united~1.0        7.42        7.38     -0.6%
                  "unit state"       12.31       12.37      0.5%
                      unit~1.0       15.41       15.49      0.5%
                          uni*        7.18        7.22      0.6%
                          un*d        7.97        8.04      0.9%
                         unit*       12.89       13.09      1.6%
                  +unit +state       28.16       28.64      1.7%
              +nebraska +state       81.26       82.67      1.7%
          spanNear([unit, state], 10, true)       11.60       11.83      2.0%
                         state       40.50       41.47      2.4%
            spanFirst(unit, 5)       47.65       48.84      2.5%
                    unit state       17.72       18.19      2.7%
                           u*d        4.27        4.48      5.0%
          

          those are the results I have for now.... Fuzzy only expands to 50 terms so that might no be very meaningful. I re-added the TermCache for this patch though...
          Will attach more info tomorrow.

          Show
          Simon Willnauer added a comment - Attaching current state - all test pass for me and luceneutils brings consistent results with trunk. Query QPS trunkQPS termstate Pct diff unit~2.0 14.70 14.39 -2.1% united~2.0 6.91 6.83 -1.1% united~1.0 7.42 7.38 -0.6% "unit state" 12.31 12.37 0.5% unit~1.0 15.41 15.49 0.5% uni* 7.18 7.22 0.6% un*d 7.97 8.04 0.9% unit* 12.89 13.09 1.6% +unit +state 28.16 28.64 1.7% +nebraska +state 81.26 82.67 1.7% spanNear([unit, state], 10, true ) 11.60 11.83 2.0% state 40.50 41.47 2.4% spanFirst(unit, 5) 47.65 48.84 2.5% unit state 17.72 18.19 2.7% u*d 4.27 4.48 5.0% those are the results I have for now.... Fuzzy only expands to 50 terms so that might no be very meaningful. I re-added the TermCache for this patch though... Will attach more info tomorrow.
          Hide
          Robert Muir added a comment -

          We shouldn't lose the clone() optimization in StandardPostingsReader...
          the class is final so it should use 'copy' instead of calling super.clone()
          This is important for -client.

          Show
          Robert Muir added a comment - We shouldn't lose the clone() optimization in StandardPostingsReader... the class is final so it should use 'copy' instead of calling super.clone() This is important for -client.
          Hide
          Simon Willnauer added a comment -

          FYI - there is a coding error in the latest patch that causes the TermState to be ignored - TermWeight uses the wrong reference to the PerReaderTermState. I will upload a new patch later this weekend

          simon

          Show
          Simon Willnauer added a comment - FYI - there is a coding error in the latest patch that causes the TermState to be ignored - TermWeight uses the wrong reference to the PerReaderTermState. I will upload a new patch later this weekend simon
          Hide
          Simon Willnauer added a comment -

          here is a new patch. I removed the hacky TermWeight part to make only MTQ single pass for now. Other TermQueries will hit the TermCache for now. No nocommit left. Currently there is some duplication / unncessary classes in the TermState hierarchy - that needs cleanup.

          BTW. I see some highlighter test failing - will look into this later...
          simon

          Show
          Simon Willnauer added a comment - here is a new patch. I removed the hacky TermWeight part to make only MTQ single pass for now. Other TermQueries will hit the TermCache for now. No nocommit left. Currently there is some duplication / unncessary classes in the TermState hierarchy - that needs cleanup. BTW. I see some highlighter test failing - will look into this later... simon
          Hide
          Michael McCandless added a comment -

          If I force scoring BQ rewrite for wildcard & prefix queries (ie set that rewrite mode and then relax BQ max clause count) I see healthy speedups (~23-27%) for these queries! Great

          While this doesn't happen w/ our default settings (ie these queries quickly cutover to constant filter rewrite), apps that change these defaults will see a gain, plus, the term cache (which today "protects" you) is terribly fragile since apps w/ many MTQ queries in flight can thrash that cache thus killing performance. This patch prevents that entirely since MTQs do their own caching of the TermStates they need: awesome.

          Show
          Michael McCandless added a comment - If I force scoring BQ rewrite for wildcard & prefix queries (ie set that rewrite mode and then relax BQ max clause count) I see healthy speedups (~23-27%) for these queries! Great While this doesn't happen w/ our default settings (ie these queries quickly cutover to constant filter rewrite), apps that change these defaults will see a gain, plus, the term cache (which today "protects" you) is terribly fragile since apps w/ many MTQ queries in flight can thrash that cache thus killing performance. This patch prevents that entirely since MTQs do their own caching of the TermStates they need: awesome.
          Hide
          Michael McCandless added a comment -

          I love seeing cacheCurrentTerm removed!!

          OK I think we are close! A bunch of smallish things:

          • I think we should remove TermsEnum.docFreq and .ord? Ie replace
            with .termState().docFreq() and .ord()?
          • At first I was thinking we should merge up TermStateBase into
            TermState but actually there are cases (eg PulsingCodec, which )
            where you want the separation.
          • Maybe rename TermStateBase -> PrefixCodedTermState? Ie this is
            really the TermState impl used by any codec using
            PrefixCodedTerms? EG the fact that it stores the filePointer into
            a _X.tis file is particular to it...
          • Maybe rename MockTermState -> BasicTermState? At first I was
            thinking the codec should return null if it cannot seek by
            TermState... (I generally don't like mock returns that hide/lose
            information...) but then it's convenient to always have something
            to hold the docFreq for the term to avoid lots of special cased
            code... so I think it's OK?
          • We lost the "clone using new" in StandardTermState...
          • Maybe revert changes to AppendingCodec? (Ie let it pass its terms
            dict cache size again)
          • I wonder if we can somehow make PerReaderTermState use an array
            (keyed by sub reader index) instead... seems like a new HashMap
            per Term in an MTQ could be heavy. It's tricky because we don't
            store enough information (ie to quickly map parent reader + sub
            reader -> sub index). But I don't think this should hold up
            committing... since our defaults don't typically allow for that
            many terms in-flight it should be fine...
          • It's a little spooky the TermQuery.scorer calls .take()
            (destructive), eg it means if you ask for scorer again on same
            reader you get diff't behavior? Can we make that a .get() instead
            of .take()? (This may also bite us if we use diff't threads to
            score each segment, ie suddenly this .take() must be thread safe).
            In fact, same deal w/ nulling out the TQ.perReaderTermState?
          • The comment on top of TermStateByteStart looks wrong?
          • Small whitespace issue – missing space on "if(". Also, our
            generics are not supposed to have whitespace inside, eg we
            shouldn't have the space in "new DoubleBarrelLRUCache<FieldAndTerm, TermStateBase>(termsCacheSize);"
          • I think the TQ ctor that takes both docFreq and states can drop
            the docFreq? Ie it can ask the states for it?
          Show
          Michael McCandless added a comment - I love seeing cacheCurrentTerm removed!! OK I think we are close! A bunch of smallish things: I think we should remove TermsEnum.docFreq and .ord? Ie replace with .termState().docFreq() and .ord()? At first I was thinking we should merge up TermStateBase into TermState but actually there are cases (eg PulsingCodec, which ) where you want the separation. Maybe rename TermStateBase -> PrefixCodedTermState? Ie this is really the TermState impl used by any codec using PrefixCodedTerms? EG the fact that it stores the filePointer into a _X.tis file is particular to it... Maybe rename MockTermState -> BasicTermState? At first I was thinking the codec should return null if it cannot seek by TermState... (I generally don't like mock returns that hide/lose information...) but then it's convenient to always have something to hold the docFreq for the term to avoid lots of special cased code... so I think it's OK? We lost the "clone using new" in StandardTermState... Maybe revert changes to AppendingCodec? (Ie let it pass its terms dict cache size again) I wonder if we can somehow make PerReaderTermState use an array (keyed by sub reader index) instead... seems like a new HashMap per Term in an MTQ could be heavy. It's tricky because we don't store enough information (ie to quickly map parent reader + sub reader -> sub index). But I don't think this should hold up committing... since our defaults don't typically allow for that many terms in-flight it should be fine... It's a little spooky the TermQuery.scorer calls .take() (destructive), eg it means if you ask for scorer again on same reader you get diff't behavior? Can we make that a .get() instead of .take()? (This may also bite us if we use diff't threads to score each segment, ie suddenly this .take() must be thread safe). In fact, same deal w/ nulling out the TQ.perReaderTermState? The comment on top of TermStateByteStart looks wrong? Small whitespace issue – missing space on "if(". Also, our generics are not supposed to have whitespace inside, eg we shouldn't have the space in "new DoubleBarrelLRUCache<FieldAndTerm, TermStateBase>(termsCacheSize);" I think the TQ ctor that takes both docFreq and states can drop the docFreq? Ie it can ask the states for it?
          Hide
          Uwe Schindler added a comment - - edited

          I have also some things:

          • We currently don't support seeking a FilteredTermsEnum, this is disallowed by UnsupportedOperationException (we may change this, but its complicated, Robert and me are thinking about it, but for now its disallowed, as it would break the enum logic). So the TermState seek method in FilteredTermsEnum should also throw UOE:
            /** This enum does not support seeking!
             * @throws UnsupportedOperationException
             */
            @Override
            public SeekStatus seek(BytesRef term, boolean useCache) throws IOException {
              throw new UnsupportedOperationException(getClass().getName()+" does not support seeking");
            }
            
          • Additionally, can the next() implementation in FilteredTermsEnum use TermState? It does lots of seeking on the underlying (filtered) TermsEnum. This is the reason why sekking on the FilteredTermsEnum is not allowed. Filtering is done here on the accept() methods.
          • For what is setNextReader in TermCollector? I don't like that, but you seems to need it for the PerReaderTermState. The collector should really only work on the enum not on any reader. At least the

          Thats what I have seen on first patch review, will now apply patch and look closer into it But the first point is important, FilteredTermsEnum currently should not support seeking.

          Show
          Uwe Schindler added a comment - - edited I have also some things: We currently don't support seeking a FilteredTermsEnum, this is disallowed by UnsupportedOperationException (we may change this, but its complicated, Robert and me are thinking about it, but for now its disallowed, as it would break the enum logic). So the TermState seek method in FilteredTermsEnum should also throw UOE: /** This enum does not support seeking! * @ throws UnsupportedOperationException */ @Override public SeekStatus seek(BytesRef term, boolean useCache) throws IOException { throw new UnsupportedOperationException(getClass().getName()+ " does not support seeking" ); } Additionally, can the next() implementation in FilteredTermsEnum use TermState? It does lots of seeking on the underlying (filtered) TermsEnum. This is the reason why sekking on the FilteredTermsEnum is not allowed. Filtering is done here on the accept() methods. For what is setNextReader in TermCollector? I don't like that, but you seems to need it for the PerReaderTermState. The collector should really only work on the enum not on any reader. At least the Thats what I have seen on first patch review, will now apply patch and look closer into it But the first point is important, FilteredTermsEnum currently should not support seeking.
          Hide
          Uwe Schindler added a comment -

          Here just the patch for a correct behaving FilteredTermsEnum (according to docs, that it does currently not support seeking). The assert is also not needed, as tenum is guranteed to be not null (its final and ctor already asserts this)

          Show
          Uwe Schindler added a comment - Here just the patch for a correct behaving FilteredTermsEnum (according to docs, that it does currently not support seeking). The assert is also not needed, as tenum is guranteed to be not null (its final and ctor already asserts this)
          Hide
          Simon Willnauer added a comment -

          I think we should remove TermsEnum.docFreq and .ord? Ie replace
          with .termState().docFreq() and .ord()?

          I disagree on that - at least docFreq() is an essential part of the API and we should not force TermState creation just to get the df. Yet, TermState is an expert API you should not need to pull an expert API to get something essential like df.
          I would leave those as they are or only pull ord into TermState.

          Maybe rename TermStateBase -> PrefixCodedTermState? Ie this is
          really the TermState impl used by any codec using
          PrefixCodedTerms? EG the fact that it stores the filePointer into
          a _X.tis file is particular to it..

          Yeah that sounds reasonable.

          Maybe rename MockTermState -> BasicTermState? At first I was
          thinking the codec should return null if it cannot seek by
          TermState... (I generally don't like mock returns that hide/lose
          information...) but then it's convenient to always have something
          to hold the docFreq for the term to avoid lots of special cased
          code... so I think it's OK?

          I think we can get rid of it entirely. We can use TermStateBase for it and let PrefixCodedTermState just add the pointer though. That way we get rid of it nicely. I would like to keep that api as it is since it makes the usage easier especially in the rewrite methods..

          We lost the "clone using new" in StandardTermState...

          I don't get that really - IMO this is quite minor but I will look into it again...

          Maybe revert changes to AppendingCodec? (Ie let it pass its terms
          dict cache size again)

          unintentional - will fix

          wonder if we can somehow make PerReaderTermState use an array
          (keyed by sub reader index) instead... seems like a new HashMap
          per Term in an MTQ could be heavy. It's tricky because we don't
          store enough information (ie to quickly map parent reader + sub
          reader -> sub index). But I don't think this should hold up
          committing... since our defaults don't typically allow for that
          many terms in-flight it should be fine...

          I actually had this in a very similar way. I used a custom linked list and relied on the fact that the incoming reader are applied in the same order and skipped until the next reader with that term appeared. I changed that back to Map impl to make it simpler since I didn't see speedups - well this was caused by a very nifty coding error

          i think I have that patch around somewhere is the history... lets see..

          I think the TQ ctor that takes both docFreq and states can drop the docFreq? Ie it can ask the states for it?

          yeah sure - well the patch is my current state since I had to drop everything and leave on friday... I clean up an upload a new patch early this week

          @Uwe: I will incorporate your fix - thanks

          Show
          Simon Willnauer added a comment - I think we should remove TermsEnum.docFreq and .ord? Ie replace with .termState().docFreq() and .ord()? I disagree on that - at least docFreq() is an essential part of the API and we should not force TermState creation just to get the df. Yet, TermState is an expert API you should not need to pull an expert API to get something essential like df. I would leave those as they are or only pull ord into TermState. Maybe rename TermStateBase -> PrefixCodedTermState? Ie this is really the TermState impl used by any codec using PrefixCodedTerms? EG the fact that it stores the filePointer into a _X.tis file is particular to it.. Yeah that sounds reasonable. Maybe rename MockTermState -> BasicTermState? At first I was thinking the codec should return null if it cannot seek by TermState... (I generally don't like mock returns that hide/lose information...) but then it's convenient to always have something to hold the docFreq for the term to avoid lots of special cased code... so I think it's OK? I think we can get rid of it entirely. We can use TermStateBase for it and let PrefixCodedTermState just add the pointer though. That way we get rid of it nicely. I would like to keep that api as it is since it makes the usage easier especially in the rewrite methods.. We lost the "clone using new" in StandardTermState... I don't get that really - IMO this is quite minor but I will look into it again... Maybe revert changes to AppendingCodec? (Ie let it pass its terms dict cache size again) unintentional - will fix wonder if we can somehow make PerReaderTermState use an array (keyed by sub reader index) instead... seems like a new HashMap per Term in an MTQ could be heavy. It's tricky because we don't store enough information (ie to quickly map parent reader + sub reader -> sub index). But I don't think this should hold up committing... since our defaults don't typically allow for that many terms in-flight it should be fine... I actually had this in a very similar way. I used a custom linked list and relied on the fact that the incoming reader are applied in the same order and skipped until the next reader with that term appeared. I changed that back to Map impl to make it simpler since I didn't see speedups - well this was caused by a very nifty coding error i think I have that patch around somewhere is the history... lets see.. I think the TQ ctor that takes both docFreq and states can drop the docFreq? Ie it can ask the states for it? yeah sure - well the patch is my current state since I had to drop everything and leave on friday... I clean up an upload a new patch early this week @Uwe: I will incorporate your fix - thanks
          Hide
          Robert Muir added a comment -

          I would leave those as they are or only pull ord into TermState.

          I agree about docfreq, but ord should go. Ord should be in e.g. StandardTermState, but not in TermState nor TermsEnum.

          This is an implementation detail for our current terms dictionary, and its silly how we just throw UOE for other codecs:
          Its codec-specific. Other terms implementations might have "something like an ord" but it definitely might not even be a long!

          Within StandardCodec etc this creates no problem as it still has access to it. If we find ourselves wanting/needing to use ord
          outside of Standard we should ask ourselves why this is and instead fix the APIs to not depend on some codec-specific long value.

          Show
          Robert Muir added a comment - I would leave those as they are or only pull ord into TermState. I agree about docfreq, but ord should go. Ord should be in e.g. StandardTermState, but not in TermState nor TermsEnum. This is an implementation detail for our current terms dictionary, and its silly how we just throw UOE for other codecs: Its codec-specific. Other terms implementations might have "something like an ord" but it definitely might not even be a long! Within StandardCodec etc this creates no problem as it still has access to it. If we find ourselves wanting/needing to use ord outside of Standard we should ask ourselves why this is and instead fix the APIs to not depend on some codec-specific long value.
          Hide
          Simon Willnauer added a comment -

          Here are some numbers for the latest patch with 10M wiki index (commitpoint: delmulti) and all MTQ rewriting to ScoreBoolean:

                         Query    QPS base    QPS spec  Pct diff
              +nebraska +state       42.96       42.17     -1.8%
                    unit state        3.63        3.61     -0.4%
                  "unit state"        1.72        1.71     -0.3%
                         state       10.55       10.54     -0.1%
          spanNear([unit, state], 10, true)        0.96        0.96      0.1%
                  +unit +state        4.03        4.04      0.2%
            spanFirst(unit, 5)        4.83        4.86      0.7%
                    united~1.0        4.76        4.86      2.1%
                      unit~1.0        2.62        2.69      2.7%
                    united~2.0        0.82        0.84      2.8%
                      unit~2.0        0.34        0.37      8.2%
                          un*d        3.55        4.14     16.6%
                          uni*        0.52        0.61     18.1%
                           u*d        0.47        0.57     19.9%
                         unit*        2.04        2.52     23.8%
          
          Show
          Simon Willnauer added a comment - Here are some numbers for the latest patch with 10M wiki index (commitpoint: delmulti) and all MTQ rewriting to ScoreBoolean: Query QPS base QPS spec Pct diff +nebraska +state 42.96 42.17 -1.8% unit state 3.63 3.61 -0.4% "unit state" 1.72 1.71 -0.3% state 10.55 10.54 -0.1% spanNear([unit, state], 10, true ) 0.96 0.96 0.1% +unit +state 4.03 4.04 0.2% spanFirst(unit, 5) 4.83 4.86 0.7% united~1.0 4.76 4.86 2.1% unit~1.0 2.62 2.69 2.7% united~2.0 0.82 0.84 2.8% unit~2.0 0.34 0.37 8.2% un*d 3.55 4.14 16.6% uni* 0.52 0.61 18.1% u*d 0.47 0.57 19.9% unit* 2.04 2.52 23.8%
          Hide
          Michael McCandless added a comment -

          I disagree on that - at least docFreq() is an essential part of the API and we should not force TermState creation just to get the df. Yet, TermState is an expert API you should not need to pull an expert API to get something essential like df.
          I would leave those as they are or only pull ord into TermState.

          OK I agree, let's leave at least dF directly in TermsEnum.

          Calling .termState presumably entails a clone right? Ie the returned object is guaranteed private? So that's a good reason not to require it...

          I actually had this in a very similar way. I used a custom linked list and relied on the fact that the incoming reader are applied in the same order and skipped until the next reader with that term appeared. I changed that back to Map impl to make it simpler since I didn't see speedups - well this was caused by a very nifty coding error

          Let's just stick w/ map for now I think? Progress not perfection!

          Show
          Michael McCandless added a comment - I disagree on that - at least docFreq() is an essential part of the API and we should not force TermState creation just to get the df. Yet, TermState is an expert API you should not need to pull an expert API to get something essential like df. I would leave those as they are or only pull ord into TermState. OK I agree, let's leave at least dF directly in TermsEnum. Calling .termState presumably entails a clone right? Ie the returned object is guaranteed private? So that's a good reason not to require it... I actually had this in a very similar way. I used a custom linked list and relied on the fact that the incoming reader are applied in the same order and skipped until the next reader with that term appeared. I changed that back to Map impl to make it simpler since I didn't see speedups - well this was caused by a very nifty coding error Let's just stick w/ map for now I think? Progress not perfection!
          Hide
          Simon Willnauer added a comment -

          here is a new patch with a slightly different implementation of PerReaderTermState. I build a view from the subreader used to build the MTQ which is shared across all PerReaderTermState instance for the query. The PrTS then uses only the ordinal from the ReaderView to reference a TermState which prevents us from creating Map instances for each term. In turn this also made it possible to fall back to re-seeking the TermDict if the reader is not in the view.

          I fixed all other issues and all tests including the highlighter pass now.

          Show
          Simon Willnauer added a comment - here is a new patch with a slightly different implementation of PerReaderTermState. I build a view from the subreader used to build the MTQ which is shared across all PerReaderTermState instance for the query. The PrTS then uses only the ordinal from the ReaderView to reference a TermState which prevents us from creating Map instances for each term. In turn this also made it possible to fall back to re-seeking the TermDict if the reader is not in the view. I fixed all other issues and all tests including the highlighter pass now.
          Hide
          Simon Willnauer added a comment -

          mike - do you mind if I take this?

          simon

          Show
          Simon Willnauer added a comment - mike - do you mind if I take this? simon
          Hide
          Michael McCandless added a comment -

          mike - do you mind if I take this?

          Of course not! Please take it

          Show
          Michael McCandless added a comment - mike - do you mind if I take this? Of course not! Please take it
          Hide
          Michael McCandless added a comment -

          Patch looks awesome!

          Really, ord() support is a function of the terms dict impl, not the
          codec (well, indirectly codec has ord() support if its terms dict impl
          does). PrefixCodedTermsDict, in turn, supports ord() only if its
          terms index does.

          Can we move TermStateBase (now under codecs.standard) up into codecs
          and rename it to PrefixCodedTermState? Ie, it's awkward that
          PrefixCodedTermsReader (a terms dict impl shared across many codecs)
          is reaching into standard codec to get its TermState impl. Then, the
          private static class in StandardPostingsReader can be renamed to
          StandardTermState?

          I like this new ReaderView! I think it can be more generally useful
          outside of PerReaderTermState, eg Filter/Collector could receive this
          so that they can map sub reader to context in parent. But let's leave
          that for another day.

          Still some small whitespace issues, eg if(

          Show
          Michael McCandless added a comment - Patch looks awesome! Really, ord() support is a function of the terms dict impl, not the codec (well, indirectly codec has ord() support if its terms dict impl does). PrefixCodedTermsDict, in turn, supports ord() only if its terms index does. Can we move TermStateBase (now under codecs.standard) up into codecs and rename it to PrefixCodedTermState? Ie, it's awkward that PrefixCodedTermsReader (a terms dict impl shared across many codecs) is reaching into standard codec to get its TermState impl. Then, the private static class in StandardPostingsReader can be renamed to StandardTermState? I like this new ReaderView! I think it can be more generally useful outside of PerReaderTermState, eg Filter/Collector could receive this so that they can map sub reader to context in parent. But let's leave that for another day. Still some small whitespace issues, eg if(
          Hide
          Michael McCandless added a comment -

          OK I tested perf on 10 M wiki index, multi-segment no deletes. For the test I [unnaturally] forced Prefix & Wildcard queries to always use scoring BQ rewrite (and upped the BQ max clause count way high) to force testing of TermState.

          Query QPS mmap QPS mmap Pct diff
          "unit state"~3 5.29 4.96 -6.3%
          unit state 11.70 11.21 -4.2%
          "unit state" 7.80 7.71 -1.2%
          spanNear([unit, state], 10, true) 4.58 4.53 -1.1%
          state 29.42 29.39 -0.1%
          unit~2.0 9.90 9.91 0.2%
          doctimesecnum:[10000 TO 60000] 9.52 9.55 0.3%
          +unit +state 11.04 11.09 0.4%
          unit~1.0 10.11 10.19 0.7%
          united~2.0 3.34 3.36 0.8%
          spanFirst(unit, 5) 16.71 16.93 1.3%
          +nebraska +state 195.03 198.25 1.7%
          united~1.0 15.78 16.11 2.1%
          un*d 12.59 29.45 133.9%
          unit* 6.87 16.54 140.7%
          u*d 2.39 6.66 178.2%
          uni* 1.82 5.29 190.6%

          Awesome speedups!

          Show
          Michael McCandless added a comment - OK I tested perf on 10 M wiki index, multi-segment no deletes. For the test I [unnaturally] forced Prefix & Wildcard queries to always use scoring BQ rewrite (and upped the BQ max clause count way high) to force testing of TermState. Query QPS mmap QPS mmap Pct diff "unit state"~3 5.29 4.96 -6.3% unit state 11.70 11.21 -4.2% "unit state" 7.80 7.71 -1.2% spanNear( [unit, state] , 10, true) 4.58 4.53 -1.1% state 29.42 29.39 -0.1% unit~2.0 9.90 9.91 0.2% doctimesecnum: [10000 TO 60000] 9.52 9.55 0.3% +unit +state 11.04 11.09 0.4% unit~1.0 10.11 10.19 0.7% united~2.0 3.34 3.36 0.8% spanFirst(unit, 5) 16.71 16.93 1.3% +nebraska +state 195.03 198.25 1.7% united~1.0 15.78 16.11 2.1% un*d 12.59 29.45 133.9% unit* 6.87 16.54 140.7% u*d 2.39 6.66 178.2% uni* 1.82 5.29 190.6% Awesome speedups!
          Hide
          Simon Willnauer added a comment -

          Can we move TermStateBase (now under codecs.standard) up into codecs
          and rename it to PrefixCodedTermState? Ie, it's awkward that
          PrefixCodedTermsReader (a terms dict impl shared across many codecs)
          is reaching into standard codec to get its TermState impl. Then, the
          private static class in StandardPostingsReader can be renamed to
          StandardTermState?

          done - that makes sense

          Still some small whitespace issues, eg if(

          done

          I think we are close

          Show
          Simon Willnauer added a comment - Can we move TermStateBase (now under codecs.standard) up into codecs and rename it to PrefixCodedTermState? Ie, it's awkward that PrefixCodedTermsReader (a terms dict impl shared across many codecs) is reaching into standard codec to get its TermState impl. Then, the private static class in StandardPostingsReader can be renamed to StandardTermState? done - that makes sense Still some small whitespace issues, eg if( done I think we are close
          Hide
          Michael McCandless added a comment -

          I think instead of ReaderView we could change Weight.scorer API so that instead of receiving IndexReader reader, it receives a struct that has parent reader, sub reader, ord of that sub?

          It's easy to be back compat because we could just forward to prior scorer method with only the sub?

          Show
          Michael McCandless added a comment - I think instead of ReaderView we could change Weight.scorer API so that instead of receiving IndexReader reader, it receives a struct that has parent reader, sub reader, ord of that sub? It's easy to be back compat because we could just forward to prior scorer method with only the sub?
          Hide
          Simon Willnauer added a comment -

          I think instead of ReaderView we could change Weight.scorer API so that instead of receiving IndexReader reader, it receives a struct that has parent reader, sub reader, ord of that sub?
          It's easy to be back compat because we could just forward to prior scorer method with only the sub?

          Mike I am not sure if that helps us here. If you use this method you can not disambiguate between the set of readers that where used to create the PerReaderTermState and the once that have a certain ord assigned to it. Disambiguation would be more difficult if we do that. IMO sharing a ReaderView seems to be the best solution so far. I don't think we should bind it to an IR directly since users can easily build a ReaderView from a Composite Reader. Yet, for searching it would be nice to have a ReaderView on Seacher / IndexSearcher which can be triggered upon weight creation.
          That way we can also disambiguate between PerReaderTermState given to the TermQuery ctor when we create the weight so that if the view doesn' t match we either create a new PerReaderTermState or just don't use it for this weight.

          I thought about TermsEnum#ord() again. I don' t think we should really add it back though. Its really an implementation detail and folks that wanna use it should be aware of that and cast correctly. On the other hand I don't like to have the seek(ord) in TermsEnum either if we remove #ord(). I think we should remove it from the interface entirely though.

          simon

          Show
          Simon Willnauer added a comment - I think instead of ReaderView we could change Weight.scorer API so that instead of receiving IndexReader reader, it receives a struct that has parent reader, sub reader, ord of that sub? It's easy to be back compat because we could just forward to prior scorer method with only the sub? Mike I am not sure if that helps us here. If you use this method you can not disambiguate between the set of readers that where used to create the PerReaderTermState and the once that have a certain ord assigned to it. Disambiguation would be more difficult if we do that. IMO sharing a ReaderView seems to be the best solution so far. I don't think we should bind it to an IR directly since users can easily build a ReaderView from a Composite Reader. Yet, for searching it would be nice to have a ReaderView on Seacher / IndexSearcher which can be triggered upon weight creation. That way we can also disambiguate between PerReaderTermState given to the TermQuery ctor when we create the weight so that if the view doesn' t match we either create a new PerReaderTermState or just don't use it for this weight. I thought about TermsEnum#ord() again. I don' t think we should really add it back though. Its really an implementation detail and folks that wanna use it should be aware of that and cast correctly. On the other hand I don't like to have the seek(ord) in TermsEnum either if we remove #ord(). I think we should remove it from the interface entirely though. simon
          Hide
          Simon Willnauer added a comment -

          I think instead of ReaderView we could change Weight.scorer API so that instead of receiving IndexReader reader, it receives a struct that has parent reader, sub reader, ord of that sub?

          so I changed the Weight#scorer API to use a class called ScoreContext that holds the parent reader, the current sub and the subs ord. That change is absolutely massive! I don't upload that change since I think if we do that we need to do it in a different issue anyway. I ran into a couple of problems:

          • if we pass in such a context we also need to change the explain interface since its calling Weight#scorer here and there
          • Once we pass in the Context stuff like QueryWrapperFilter doesn't work anymore since it doesn't know which ord the incoming reader has. So Filters would need a context too. I don't like that!
          • Stuff like scoreDocsInOrder are hard to put into such a context since almost all scorers internally are called with scoreDocsInOrder=true with a contant. meaning that nobody really respects the incoming value for subscorers though. but if i just forward the context the member needs to be set to true or the context needs to be cloned for subs - see BooleanQuery for instance.
          • such a context would somehow enforce that MTQ are only executed against the Reader they where rewritten against. Which is how it should be IMO but we are also depending on that everybody who uses a MTQ knows exactly how the query was rewritten which is kind of not obvious. I think we need a better way to enforce stuff like that.

          after all I think this must be done in a different issue though. This issue was meant to make MTQ single pass so lets do that first.... progress over perfection ey mike .
          Nonetheless, it seems like that we need to rethink the Weight API entirely I also don't like that Weight operates on Searcher instead of IndexSearcher though.

          Show
          Simon Willnauer added a comment - I think instead of ReaderView we could change Weight.scorer API so that instead of receiving IndexReader reader, it receives a struct that has parent reader, sub reader, ord of that sub? so I changed the Weight#scorer API to use a class called ScoreContext that holds the parent reader, the current sub and the subs ord. That change is absolutely massive! I don't upload that change since I think if we do that we need to do it in a different issue anyway. I ran into a couple of problems: if we pass in such a context we also need to change the explain interface since its calling Weight#scorer here and there Once we pass in the Context stuff like QueryWrapperFilter doesn't work anymore since it doesn't know which ord the incoming reader has. So Filters would need a context too. I don't like that! Stuff like scoreDocsInOrder are hard to put into such a context since almost all scorers internally are called with scoreDocsInOrder=true with a contant. meaning that nobody really respects the incoming value for subscorers though. but if i just forward the context the member needs to be set to true or the context needs to be cloned for subs - see BooleanQuery for instance. such a context would somehow enforce that MTQ are only executed against the Reader they where rewritten against. Which is how it should be IMO but we are also depending on that everybody who uses a MTQ knows exactly how the query was rewritten which is kind of not obvious. I think we need a better way to enforce stuff like that. after all I think this must be done in a different issue though. This issue was meant to make MTQ single pass so lets do that first.... progress over perfection ey mike . Nonetheless, it seems like that we need to rethink the Weight API entirely I also don't like that Weight operates on Searcher instead of IndexSearcher though.
          Hide
          Michael McCandless added a comment -

          after all I think this must be done in a different issue though

          +1

          If, when we now pass a naked IndexReader (eg to Weight.scorer, Weight.explain, Filter.getDocIdSet) we replace that with a ReaderContext which has reader, its parent, and its ord, then this precursor makes both TermState (this issue) and the awesome PK speedup (LUCENE-2829) much simpler. And I agree we should break it out as its own issue. It's good to do that as its own issue since that's a rote API cutover – we are passing a struct instead of a naked reader, but otherwise no change.

          This also lets us solve cases where the Filter needs the full context, eg LUCENE-2348.

          Also, with this I think we should sharpen in the jdocs that when you call Query.rewrite the returned query must be searched only against he same reader you rewrote against. Similarly when you create a Weight, it should only be used against the same Searcher used to create it from a Query.

          Show
          Michael McCandless added a comment - after all I think this must be done in a different issue though +1 If, when we now pass a naked IndexReader (eg to Weight.scorer, Weight.explain, Filter.getDocIdSet) we replace that with a ReaderContext which has reader, its parent, and its ord, then this precursor makes both TermState (this issue) and the awesome PK speedup ( LUCENE-2829 ) much simpler. And I agree we should break it out as its own issue. It's good to do that as its own issue since that's a rote API cutover – we are passing a struct instead of a naked reader, but otherwise no change. This also lets us solve cases where the Filter needs the full context, eg LUCENE-2348 . Also, with this I think we should sharpen in the jdocs that when you call Query.rewrite the returned query must be searched only against he same reader you rewrote against. Similarly when you create a Weight, it should only be used against the same Searcher used to create it from a Query.
          Hide
          Simon Willnauer added a comment -

          Another iteration on this after LUCENE-2831 was committed last week.

          • updated to trunk & all test pass
          • re-added all ord() related stuff back to TermsEnum since I think we should decouple this and solve it in a different issue. There is already enough changes in here and discussions should be focused on making MTQ single pass.
          • Changed IndexSearcher to run concurrent searches on a "leaf slice" rather than on a leaf converted to a Top-Level Context. That made the callables a bit simpler and is more consistent since the hierarchy is preserved.
          • TermState is now referenced by leaf ordinal and asserted using the leaf's top-level ctx.
          • TermQuery is not single pass for all queries while state is only hold in Weight unless PerReaderTermState as not set. But even then the top-level ctx must be identical to the given IS's top-level ctx otherwise the give PerReaderTermState is not used.

          this one seems pretty close

          Show
          Simon Willnauer added a comment - Another iteration on this after LUCENE-2831 was committed last week. updated to trunk & all test pass re-added all ord() related stuff back to TermsEnum since I think we should decouple this and solve it in a different issue. There is already enough changes in here and discussions should be focused on making MTQ single pass. Changed IndexSearcher to run concurrent searches on a "leaf slice" rather than on a leaf converted to a Top-Level Context. That made the callables a bit simpler and is more consistent since the hierarchy is preserved. TermState is now referenced by leaf ordinal and asserted using the leaf's top-level ctx. TermQuery is not single pass for all queries while state is only hold in Weight unless PerReaderTermState as not set. But even then the top-level ctx must be identical to the given IS's top-level ctx otherwise the give PerReaderTermState is not used. this one seems pretty close
          Hide
          Robert Muir added a comment -

          One question, I'm look at the definition of TermState:

          Holds all state required for {@link TermsEnum} to produce a {@link DocsEnum} without re-seeking the terms dict.
          

          So why do we have seek(BytesRef, TermState)
          shouldnt it just be seek(Termstate) ?
          I think its confusing it takes an unnecessary bytes parameter.

          Show
          Robert Muir added a comment - One question, I'm look at the definition of TermState: Holds all state required for {@link TermsEnum} to produce a {@link DocsEnum} without re-seeking the terms dict. So why do we have seek(BytesRef, TermState) shouldnt it just be seek(Termstate) ? I think its confusing it takes an unnecessary bytes parameter.
          Hide
          Robert Muir added a comment -

          here's a hack patch (dont think it actually works) just showing what i mean.

          I think termsenum should only have seek(TermState).
          in the hack-patch, i made the termState() and seekTermState() non-abstract:
          the default impl returns a 'SimpleTermState' containing the term bytes and saved docFreq and implements seek(TermState) with those bytes.

          This is basically what the patch had everywhere anyway as an implementation (for many of these, we should use more efficient impls, i fixed this for MemoryIndex as an example, but MultiTermsEnum comes to mind).

          Also, i don't understand what was going on with setting bytes on the DeltaBytesReader with your seek(BytesRef, TermState) before.

          If StandardCodec needs to know the shared byte[] prefix or something like that to reposition the enum, then it
          should put this in its termstate.

          Show
          Robert Muir added a comment - here's a hack patch (dont think it actually works) just showing what i mean. I think termsenum should only have seek(TermState). in the hack-patch, i made the termState() and seekTermState() non-abstract: the default impl returns a 'SimpleTermState' containing the term bytes and saved docFreq and implements seek(TermState) with those bytes. This is basically what the patch had everywhere anyway as an implementation (for many of these, we should use more efficient impls, i fixed this for MemoryIndex as an example, but MultiTermsEnum comes to mind). Also, i don't understand what was going on with setting bytes on the DeltaBytesReader with your seek(BytesRef, TermState) before. If StandardCodec needs to know the shared byte[] prefix or something like that to reposition the enum, then it should put this in its termstate.
          Hide
          Simon Willnauer added a comment -

          Next iteration. I took roberts patch and cleaned up a few things and added a new OrdTermState that can be used for instanceof testing and thoughout all TermsEnum that use ord primarily. I also removed the docFreq() getter from TermState since its really an impl. detail. The downside of this patch is that PrefixCodedTermState is kind of heavyweight now since it carries the BytesRef to re-init the DeltaBytesReader but I didn't see another way to fix this right now.

          Show
          Simon Willnauer added a comment - Next iteration. I took roberts patch and cleaned up a few things and added a new OrdTermState that can be used for instanceof testing and thoughout all TermsEnum that use ord primarily. I also removed the docFreq() getter from TermState since its really an impl. detail. The downside of this patch is that PrefixCodedTermState is kind of heavyweight now since it carries the BytesRef to re-init the DeltaBytesReader but I didn't see another way to fix this right now.
          Hide
          Michael McCandless added a comment -

          The failure in TestFSTs is because PrefixCodedTermsReader is somehow returning an OrdTermState when its terms index (var gap) does not support ord.

          The ord member of PrefixCodedTermState is undefined when the terms dict doesn't support ord (ie when ord() throws UOE).

          So to fix this we should fix TestFSTs to go back to calling .ord() and catching the UOE, maybe? Separately, make sure you don't overwrite storeOrds in that test. Ie the test randomly sets it to true or false (so that we test both cases); only if the terms index cannot suppord ord should we wire it to false. If it can support ord then we should leave it as the random value...

          Show
          Michael McCandless added a comment - The failure in TestFSTs is because PrefixCodedTermsReader is somehow returning an OrdTermState when its terms index (var gap) does not support ord. The ord member of PrefixCodedTermState is undefined when the terms dict doesn't support ord (ie when ord() throws UOE). So to fix this we should fix TestFSTs to go back to calling .ord() and catching the UOE, maybe? Separately, make sure you don't overwrite storeOrds in that test. Ie the test randomly sets it to true or false (so that we test both cases); only if the terms index cannot suppord ord should we wire it to false. If it can support ord then we should leave it as the random value...
          Hide
          Simon Willnauer added a comment -

          fixed the TestFST - thanks mike for looking into that and updated to trunk

          Show
          Simon Willnauer added a comment - fixed the TestFST - thanks mike for looking into that and updated to trunk
          Hide
          Simon Willnauer added a comment -

          This patch changes TermsEnum#seek(TermState) back to TermsEnum#seek(BytesRef, TermState). Yet, TermState is opaque now and TermsEnum has a default impl for TermsEnum#seek(BytesRef, TermState). Holding the BytesRef in TermState for our PrefixCoded* based codecs seems way too costly though. seems like this time perf rules out purity in the interface.

          Show
          Simon Willnauer added a comment - This patch changes TermsEnum#seek(TermState) back to TermsEnum#seek(BytesRef, TermState). Yet, TermState is opaque now and TermsEnum has a default impl for TermsEnum#seek(BytesRef, TermState). Holding the BytesRef in TermState for our PrefixCoded* based codecs seems way too costly though. seems like this time perf rules out purity in the interface.
          Hide
          Robert Muir added a comment -

          seems like this time perf rules out purity in the interface.

          I know i didn't like this aspect of the patch, but I am ok with it for now as long as we keep things experimental and try to keep an eye on improving the 'purity' of TermsEnum a bit.
          we are making a lot of progress on the terms handling with flexible indexing and i could easily see more interesting implementations being available other than just PrefixCoded...
          In some ideal world I guess i'd prefer if TermsEnum was an attributesource with seek() and next(), FilteredTermsEnum was like tokenFilter, and TermState was just captureState/restoreState...
          but I agree we should just lean towards whatever works for now.

          definitely like it better now that things such as docFreq() are pulled out of termstate and its completely opaque, i think this is the right way to go.

          Show
          Robert Muir added a comment - seems like this time perf rules out purity in the interface. I know i didn't like this aspect of the patch, but I am ok with it for now as long as we keep things experimental and try to keep an eye on improving the 'purity' of TermsEnum a bit. we are making a lot of progress on the terms handling with flexible indexing and i could easily see more interesting implementations being available other than just PrefixCoded... In some ideal world I guess i'd prefer if TermsEnum was an attributesource with seek() and next(), FilteredTermsEnum was like tokenFilter, and TermState was just captureState/restoreState... but I agree we should just lean towards whatever works for now. definitely like it better now that things such as docFreq() are pulled out of termstate and its completely opaque, i think this is the right way to go.
          Hide
          Simon Willnauer added a comment -

          Added Changes.txt entry and fixed the remaining JavaDoc on TermState.

          My latest benchmark results with that patch are here:

          
                    unit state        3.81        3.70     -2.9%
              +nebraska +state       41.26       40.61     -1.6%
                  +unit +state        3.95        3.90     -1.1%
            spanFirst(unit, 5)        4.55        4.51     -0.9%
                         state       10.11       10.07     -0.3%
                "unit state"~3        0.98        0.98     -0.2%
                  "unit state"        1.49        1.49     -0.0%
                    united~1.0        3.66        3.72      1.5%
                      unit~1.0        2.33        2.37      1.6%
                    united~2.0        0.81        0.83      2.7%
                      unit~2.0        0.35        0.38     10.1%
                           u*d        0.52        0.67     29.5%
          doctitle:.*[Uu]nited.*        0.19        0.25     31.6%
                          un*d        3.59        4.77     33.0%
                          uni*        0.56        0.75     34.9%
                         unit*        2.20        3.15     43.3%
          

          I think we are ready to go - I will commit later today if nobody objects

          Show
          Simon Willnauer added a comment - Added Changes.txt entry and fixed the remaining JavaDoc on TermState. My latest benchmark results with that patch are here: unit state 3.81 3.70 -2.9% +nebraska +state 41.26 40.61 -1.6% +unit +state 3.95 3.90 -1.1% spanFirst(unit, 5) 4.55 4.51 -0.9% state 10.11 10.07 -0.3% "unit state" ~3 0.98 0.98 -0.2% "unit state" 1.49 1.49 -0.0% united~1.0 3.66 3.72 1.5% unit~1.0 2.33 2.37 1.6% united~2.0 0.81 0.83 2.7% unit~2.0 0.35 0.38 10.1% u*d 0.52 0.67 29.5% doctitle:.*[Uu]nited.* 0.19 0.25 31.6% un*d 3.59 4.77 33.0% uni* 0.56 0.75 34.9% unit* 2.20 3.15 43.3% I think we are ready to go - I will commit later today if nobody objects
          Hide
          Simon Willnauer added a comment -

          I just figured that PKLookups are actually slower with this patch 164 msec for 1000 lookups (164 us per lookup) vs 144 msec for 1000 lookups (144 us per lookup) on trunk. I will dig!

          Show
          Simon Willnauer added a comment - I just figured that PKLookups are actually slower with this patch 164 msec for 1000 lookups (164 us per lookup) vs 144 msec for 1000 lookups (144 us per lookup) on trunk. I will dig!
          Hide
          Michael McCandless added a comment -

          Actually I see PK lookups faster – 23 usec w/ patch vs 33 usec w/ trunk (per lookup) for 20K lookups.

          And good speedups on many-term MTQs when I force BQ rewrite:

          Query QPS base QPS termstate Pct diff
          +nebraska +state 169.75 154.64 -8.9%
          doctitle:.[Uu]nited. 4.26 4.11 -3.5%
          +unit +state 11.40 11.09 -2.7%
          spanFirst(unit, 5) 17.38 16.93 -2.6%
          spanNear([unit, state], 10, true) 4.37 4.32 -1.2%
          "unit state"~3 4.94 4.89 -1.0%
          "unit state" 8.05 8.03 -0.2%
          state 26.58 26.76 0.7%
          unit state 11.24 11.46 1.9%
          united~2.0 3.87 3.98 2.8%
          doctimesecnum:[10000 TO 60000] 8.26 8.70 5.3%
          unit~2.0 10.04 10.59 5.4%
          united~1.0 16.84 18.13 7.7%
          unit~1.0 10.09 10.99 8.9%
          un*d 11.96 21.63 80.8%
          unit* 7.60 14.23 87.3%
          u*d 2.22 4.17 87.8%
          uni* 1.83 3.53 93.7%

          +1 to commit!

          Show
          Michael McCandless added a comment - Actually I see PK lookups faster – 23 usec w/ patch vs 33 usec w/ trunk (per lookup) for 20K lookups. And good speedups on many-term MTQs when I force BQ rewrite: Query QPS base QPS termstate Pct diff +nebraska +state 169.75 154.64 -8.9% doctitle:. [Uu] nited. 4.26 4.11 -3.5% +unit +state 11.40 11.09 -2.7% spanFirst(unit, 5) 17.38 16.93 -2.6% spanNear( [unit, state] , 10, true) 4.37 4.32 -1.2% "unit state"~3 4.94 4.89 -1.0% "unit state" 8.05 8.03 -0.2% state 26.58 26.76 0.7% unit state 11.24 11.46 1.9% united~2.0 3.87 3.98 2.8% doctimesecnum: [10000 TO 60000] 8.26 8.70 5.3% unit~2.0 10.04 10.59 5.4% united~1.0 16.84 18.13 7.7% unit~1.0 10.09 10.99 8.9% un*d 11.96 21.63 80.8% unit* 7.60 14.23 87.3% u*d 2.22 4.17 87.8% uni* 1.83 3.53 93.7% +1 to commit!
          Hide
          Simon Willnauer added a comment -

          Here is a final patch, I opened up Terms#getThreadTermsEnum() to reuse TermsEnum in PRTE#build().
          PRTE#build() now also accepts a boolean if the termlookup should be cached or not which makes sense for common TermQuery.

          I will commit that shortly - yay!

          Show
          Simon Willnauer added a comment - Here is a final patch, I opened up Terms#getThreadTermsEnum() to reuse TermsEnum in PRTE#build(). PRTE#build() now also accepts a boolean if the termlookup should be cached or not which makes sense for common TermQuery. I will commit that shortly - yay!
          Hide
          Simon Willnauer added a comment -

          Actually I see PK lookups faster - 23 usec w/ patch vs 33 usec w/ trunk (per lookup) for 20K lookups.

          so I run that on a 32bit machine which is quite slow in general though. I will further investigate that on 32bit platform vs. 64 bit. Yet, I only used 1k lookups though.

          Show
          Simon Willnauer added a comment - Actually I see PK lookups faster - 23 usec w/ patch vs 33 usec w/ trunk (per lookup) for 20K lookups. so I run that on a 32bit machine which is quite slow in general though. I will further investigate that on 32bit platform vs. 64 bit. Yet, I only used 1k lookups though.
          Hide
          Simon Willnauer added a comment -

          Committed revision 1058328.

          Show
          Simon Willnauer added a comment - Committed revision 1058328.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development