Lucene - Core
  1. Lucene - Core
  2. LUCENE-3225

Optimize TermsEnum.seek when caller doesn't need next term

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Some codecs are able to save CPU if the caller is only interested in
      exact matches. EG, Memory codec and SimpleText can do more efficient
      FSTEnum lookup if they know the caller doesn't need to know the term
      following the seek term.

      We have cases like this in Lucene, eg when IW deletes documents by
      Term, if the term is not found in a given segment then it doesn't need
      to know the ceiling term. Likewise when TermQuery looks up the term
      in each segment.

      I had done this change as part of LUCENE-3030, which is a new terms
      index that's able to save seeking for exact-only lookups, but now that
      we have Memory codec that can also save CPU I think we should commit
      this today.

      The change adds a "boolean onlyExact" param to seek(BytesRef).

      1. LUCENE-3225.patch
        32 kB
        Michael McCandless
      2. LUCENE-3225.patch
        80 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch. All tests pass... I think it's ready!

        Show
        Michael McCandless added a comment - Patch. All tests pass... I think it's ready!
        Hide
        Yonik Seeley added a comment -

        +1, looks good.

        Show
        Yonik Seeley added a comment - +1, looks good.
        Hide
        Simon Willnauer added a comment -

        Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy?

        Show
        Simon Willnauer added a comment - Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy?
        Hide
        Michael McCandless added a comment -

        Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here.

        Well, it only means the enum is unpositioned if you get back
        NOT_FOUND? Ie, it's just like if you get back null from next(), or
        END from seek(): in these cases, the enum is unpositioned and you need
        to call seek again.

        My worry if we force an up-front decision here ("exact only" enum vs
        "non-exact only enum") is we prevent legitimate use cases where the
        caller wants to mix & match with one enum.

        EG, when AutomatonQuery intersects w/ the terms, when it hits are
        region where terms are denser than what the automaton will accept
        (such as an "infinite" part), it should use exact seeking, but then
        when it's in a region where terms are less dense (eg a "finite" part)
        it should use exact seeking.... I'll open a separate issue for this.

        The TermsEnum impls can be efficient in this case, ie re-using
        internal seek state for the exat and non-exact cases (MemoryCodec does
        this).

        But I agree another boolean to seek isn't great; maybe instead we can
        make a seperate seekExact method? Default impl would just call seek
        (and get no perf gains).

        BTW, similarly, I think we have a missing API in DISI (for
        scoring): advance always does a next() if the target doc doesn't
        match. But we can get substantial performance gains in some cases
        (see LUCENE-1536) if we had an advanceExact that would not do the
        next and simply tell us if this doc matched or not.

        Can we somehow leave the extra CPU work to the term() call and make this entirely lazy?

        Not sure what you meant here?

        Show
        Michael McCandless added a comment - Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here. Well, it only means the enum is unpositioned if you get back NOT_FOUND? Ie, it's just like if you get back null from next(), or END from seek(): in these cases, the enum is unpositioned and you need to call seek again. My worry if we force an up-front decision here ("exact only" enum vs "non-exact only enum") is we prevent legitimate use cases where the caller wants to mix & match with one enum. EG, when AutomatonQuery intersects w/ the terms, when it hits are region where terms are denser than what the automaton will accept (such as an "infinite" part), it should use exact seeking, but then when it's in a region where terms are less dense (eg a "finite" part) it should use exact seeking.... I'll open a separate issue for this. The TermsEnum impls can be efficient in this case, ie re-using internal seek state for the exat and non-exact cases (MemoryCodec does this). But I agree another boolean to seek isn't great; maybe instead we can make a seperate seekExact method? Default impl would just call seek (and get no perf gains). BTW, similarly, I think we have a missing API in DISI (for scoring): advance always does a next() if the target doc doesn't match. But we can get substantial performance gains in some cases (see LUCENE-1536 ) if we had an advanceExact that would not do the next and simply tell us if this doc matched or not. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy? Not sure what you meant here?
        Hide
        Michael McCandless added a comment -

        This patch gives nice gains for MemoryCodec: I did a quick test w/ my
        NRT stress test (reopen at 2X Twitter's peak indexing rate) and the
        reopen time dropped from ~49 msec to ~43 msec (~12% faster). This is
        impressive because resolving deletes is just one part of opening the
        NRT reader, ie we also must write the new segment, open SegmentReader
        against it, etc.

        Show
        Michael McCandless added a comment - This patch gives nice gains for MemoryCodec: I did a quick test w/ my NRT stress test (reopen at 2X Twitter's peak indexing rate) and the reopen time dropped from ~49 msec to ~43 msec (~12% faster). This is impressive because resolving deletes is just one part of opening the NRT reader, ie we also must write the new segment, open SegmentReader against it, etc.
        Hide
        Simon Willnauer added a comment -

        BTW, similarly, I think we have a missing API in DISI (for
        scoring): advance always does a next() if the target doc doesn't
        match. But we can get substantial performance gains in some cases
        (see LUCENE-1536) if we had an advanceExact that would not do the
        next and simply tell us if this doc matched or not.

        +1!!

        But I agree another boolean to seek isn't great; maybe instead we can
        make a seperate seekExact method? Default impl would just call seek
        (and get no perf gains).

        thats another option and I like that better though. Yet the other should the be seekFloor no?

        not sure what you meant here?

        nevermind I only looked at the top of the patch and figured that we only safe the loading into bytesref but there is more about it...

        Show
        Simon Willnauer added a comment - BTW, similarly, I think we have a missing API in DISI (for scoring): advance always does a next() if the target doc doesn't match. But we can get substantial performance gains in some cases (see LUCENE-1536 ) if we had an advanceExact that would not do the next and simply tell us if this doc matched or not. +1!! But I agree another boolean to seek isn't great; maybe instead we can make a seperate seekExact method? Default impl would just call seek (and get no perf gains). thats another option and I like that better though. Yet the other should the be seekFloor no? not sure what you meant here? nevermind I only looked at the top of the patch and figured that we only safe the loading into bytesref but there is more about it...
        Hide
        Michael McCandless added a comment -

        Yet the other should the be seekFloor no?

        Ahhh right, we had discussed on the dev list. I agree!

        But, we should do this in another issue. Though, I think we should rename the current seek to seekCeil; I'll do that here.

        Show
        Michael McCandless added a comment - Yet the other should the be seekFloor no? Ahhh right, we had discussed on the dev list. I agree! But, we should do this in another issue. Though, I think we should rename the current seek to seekCeil; I'll do that here.
        Hide
        Michael McCandless added a comment -

        OK, new patch: I added a new seekExact method (instead of new boolean to seek); renamed existing seek methods to either seekCeil or seekExact; changed seekExact(long ord) to not return a value (it's an error to pass out-of-bounds ord to this method). I think it's ready!

        Show
        Michael McCandless added a comment - OK, new patch: I added a new seekExact method (instead of new boolean to seek); renamed existing seek methods to either seekCeil or seekExact; changed seekExact(long ord) to not return a value (it's an error to pass out-of-bounds ord to this method). I think it's ready!
        Hide
        Dawid Weiss added a comment -

        I like this one better. boolean args are cryptic (even if I do use them from time to time).

        Show
        Dawid Weiss added a comment - I like this one better. boolean args are cryptic (even if I do use them from time to time).
        Hide
        Simon Willnauer added a comment -

        looks good +1 to commit! thanks for working on that

        Show
        Simon Willnauer added a comment - looks good +1 to commit! thanks for working on that

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development