Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Having upper/lower bounds on terms could be useful for various optimizations in the future, e.g. to accelerate sorting (if a segment can't compete, don't even search it), and so on.

      Its pretty obvious how to get the smallest term, but the maximum term for a field is tricky, but worst case you can do it in ~ log(N) time by binary searching term space.

      1. LUCENE-5610.patch
        30 kB
        Michael McCandless
      2. LUCENE-5610.patch
        28 kB
        Michael McCandless
      3. LUCENE-5610.patch
        20 kB
        Michael McCandless
      4. LUCENE-5610.patch
        10 kB
        Michael McCandless
      5. LUCENE-5610.patch
        6 kB
        Robert Muir

        Activity

        Hide
        ASF subversion and git services added a comment -

        Commit 1613303 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1613303 ]

        LUCENE-5610: optimization: just use already allocated/copied PendingTerm to for min/maxTerm

        Show
        ASF subversion and git services added a comment - Commit 1613303 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1613303 ] LUCENE-5610 : optimization: just use already allocated/copied PendingTerm to for min/maxTerm
        Hide
        ASF subversion and git services added a comment -

        Commit 1613268 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1613268 ]

        LUCENE-5610: optimization: just use already allocated/copied PendingTerm to for min/maxTerm

        Show
        ASF subversion and git services added a comment - Commit 1613268 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1613268 ] LUCENE-5610 : optimization: just use already allocated/copied PendingTerm to for min/maxTerm
        Hide
        ASF subversion and git services added a comment -

        Commit 1589782 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1589782 ]

        LUCENE-5610: don't use Lucene3x codec (the test writes arbitrary binary terms)

        Show
        ASF subversion and git services added a comment - Commit 1589782 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1589782 ] LUCENE-5610 : don't use Lucene3x codec (the test writes arbitrary binary terms)
        Hide
        ASF subversion and git services added a comment -

        Commit 1589752 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1589752 ]

        LUCENE-5610: improve CheckIndex checking; javadocs

        Show
        ASF subversion and git services added a comment - Commit 1589752 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1589752 ] LUCENE-5610 : improve CheckIndex checking; javadocs
        Hide
        ASF subversion and git services added a comment -

        Commit 1589749 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1589749 ]

        LUCENE-5610: add Terms.getMin/Max

        Show
        ASF subversion and git services added a comment - Commit 1589749 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1589749 ] LUCENE-5610 : add Terms.getMin/Max
        Hide
        ASF subversion and git services added a comment -

        Commit 1589729 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1589729 ]

        LUCENE-5610: add Terms.getMin/Max

        Show
        ASF subversion and git services added a comment - Commit 1589729 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1589729 ] LUCENE-5610 : add Terms.getMin/Max
        Hide
        Michael McCandless added a comment -

        Thanks Rob, good point about the BT docs, I fixed that, and cleaned up the writing/reading of the min/maxTerm a bit ... I think it's ready. I'll commit soon.

        Show
        Michael McCandless added a comment - Thanks Rob, good point about the BT docs, I fixed that, and cleaned up the writing/reading of the min/maxTerm a bit ... I think it's ready. I'll commit soon.
        Hide
        Robert Muir added a comment -

        Looks good. Thanks for making the binarysearch cleaner

        Can we update fileformat docs for blocktree before committing?

        Show
        Robert Muir added a comment - Looks good. Thanks for making the binarysearch cleaner Can we update fileformat docs for blocktree before committing?
        Hide
        Michael McCandless added a comment -

        Another iteration, fixing all nocommits. I think it's ready ...

        I moved the seekCeil impl into a private subclass of FilteredTermsEnum, only for the numeric case: it's too scary thinking about seeking a FuzzyTermsEnum!

        I also added sugar static methods to get the min/max int/long from a Terms.

        Finally, I fixed BlockTreeWriter/Reader to save the min and max term.

        Show
        Michael McCandless added a comment - Another iteration, fixing all nocommits. I think it's ready ... I moved the seekCeil impl into a private subclass of FilteredTermsEnum, only for the numeric case: it's too scary thinking about seeking a FuzzyTermsEnum! I also added sugar static methods to get the min/max int/long from a Terms. Finally, I fixed BlockTreeWriter/Reader to save the min and max term.
        Hide
        Michael McCandless added a comment -

        Another patch, I think it's closer:

        • I re-implemented Terms.getMax default impl as a binary search
          digit by digit
        • I added FilteredTermsEnum.seekCeil, but that code is tricky ...
        • Impl'd MultiTerms.getMin/Max
        • Test cases, including showing we get the right result for min/max
          of numeric fields

        I'd like to add some sugar to make the numeric case easier, e.g. maybe
        NumericUtils.min/max(Terms)?

        I think we can separately tackle a more efficient BlockTree impl for
        min/maxTerm; I think it could just store this in the index and
        fallback to super for older indices.

        Show
        Michael McCandless added a comment - Another patch, I think it's closer: I re-implemented Terms.getMax default impl as a binary search digit by digit I added FilteredTermsEnum.seekCeil, but that code is tricky ... Impl'd MultiTerms.getMin/Max Test cases, including showing we get the right result for min/max of numeric fields I'd like to add some sugar to make the numeric case easier, e.g. maybe NumericUtils.min/max(Terms)? I think we can separately tackle a more efficient BlockTree impl for min/maxTerm; I think it could just store this in the index and fallback to super for older indices.
        Hide
        Robert Muir added a comment -

        It's a long shot ... but maybe we could index the "exact" trie terms
        in a separate field than their precShifts? Then we could just take
        max of the "exact trie terms" field ... that's a big change though

        I think there is a much simpler solution though, we just dont return FilteredTermsEnum from that one method, instead a regular termsenum. it just has to look for the special byte prefix and return END.

        Then we can easily do what we want.

        Show
        Robert Muir added a comment - It's a long shot ... but maybe we could index the "exact" trie terms in a separate field than their precShifts? Then we could just take max of the "exact trie terms" field ... that's a big change though I think there is a much simpler solution though, we just dont return FilteredTermsEnum from that one method, instead a regular termsenum. it just has to look for the special byte prefix and return END. Then we can easily do what we want.
        Hide
        Michael McCandless added a comment -

        New possible patch. I switched to using BigInteger to compute the
        midpoint ... but some tests are very slow, e.g.:

        ot TestDirectoryTaxonomyWriter.testHugeLabel -seed B50345AE17F7D036

        It's a long shot ... but maybe we could index the "exact" trie terms
        in a separate field than their precShifts? Then we could just take
        max of the "exact trie terms" field ... that's a big change though

        Show
        Michael McCandless added a comment - New possible patch. I switched to using BigInteger to compute the midpoint ... but some tests are very slow, e.g.: ot TestDirectoryTaxonomyWriter.testHugeLabel -seed B50345AE17F7D036 It's a long shot ... but maybe we could index the "exact" trie terms in a separate field than their precShifts? Then we could just take max of the "exact trie terms" field ... that's a big change though
        Hide
        Robert Muir added a comment -

        Here is a prototype. I think codecs like blocktree could just override and probably get the maximum value in constant time.

        This stuff is probably most useful for numeric fields, but unfortunately they add extra terms, and their termsenums aren't seekable. So maybe NumericUtils.filterPrefixCoded* can be fixed to return "real" termsenums rather than FilteredTermsEnum?

        FilteredTermsEnum is much less useful in general, its pretty much only good for MultiTermQuery execution.

        Show
        Robert Muir added a comment - Here is a prototype. I think codecs like blocktree could just override and probably get the maximum value in constant time. This stuff is probably most useful for numeric fields, but unfortunately they add extra terms, and their termsenums aren't seekable. So maybe NumericUtils.filterPrefixCoded* can be fixed to return "real" termsenums rather than FilteredTermsEnum? FilteredTermsEnum is much less useful in general, its pretty much only good for MultiTermQuery execution.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development