Lucene - Core
  1. Lucene - Core
  2. LUCENE-2760

optimize spanfirstquery, spanpositionrangequery

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      SpanFirstQuery and SpanPositionRangeQuery (SpanFirst is just a special case of this), are currently inefficient.

      Take this worst case example: SpanFirstQuery("the").
      Currently the code reads all the positions for the term "the".

      But when enumerating spans, once we have passed the allowable range we should move on to the next document (skipTo)

      1. LUCENE-2760.patch
        8 kB
        Robert Muir
      2. LUCENE-2760.patch
        7 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here's the patch, the SpanPositionCheckQuery now has logic similar to FilteredTermsEnum,
        instead of returning a boolean true/false for whether a match is acceptable,
        it can return YES, NO, NO_AND_ADVANCE

        Show
        Robert Muir added a comment - here's the patch, the SpanPositionCheckQuery now has logic similar to FilteredTermsEnum, instead of returning a boolean true/false for whether a match is acceptable, it can return YES, NO, NO_AND_ADVANCE
        Hide
        Robert Muir added a comment -

        Admittedly, I don't yet have a good benchmarking setup for these spanqueries yet.

        But from doing a quick test on a 125k doc corpus, the SpanFirstQuery on a common term like "the" took
        about half the time.. this is because it read/evaluated 117,556 positions instead of 1,029,622 positions.

        Show
        Robert Muir added a comment - Admittedly, I don't yet have a good benchmarking setup for these spanqueries yet. But from doing a quick test on a 125k doc corpus, the SpanFirstQuery on a common term like "the" took about half the time.. this is because it read/evaluated 117,556 positions instead of 1,029,622 positions.
        Hide
        Robert Muir added a comment -

        here's an updated patch, with javadocs.

        additionally i now check for spans.start() >= end() instead of spans.start() > end()

        i believe its invalid to have a zero-length span (e.g. for a single term end = start+1)
        I added an assert to check for this, and all tests still pass.

        Show
        Robert Muir added a comment - here's an updated patch, with javadocs. additionally i now check for spans.start() >= end() instead of spans.start() > end() i believe its invalid to have a zero-length span (e.g. for a single term end = start+1) I added an assert to check for this, and all tests still pass.
        Hide
        Robert Muir added a comment -

        I'd like to commit this soon if there are no objections.

        these apis (SpanPositionCheckQuery) are new in 3.x/trunk so theres no backwards break

        Show
        Robert Muir added a comment - I'd like to commit this soon if there are no objections. these apis (SpanPositionCheckQuery) are new in 3.x/trunk so theres no backwards break
        Hide
        Robert Muir added a comment -

        Committed revision 1035397, 1035411 (3x)

        Show
        Robert Muir added a comment - Committed revision 1035397, 1035411 (3x)
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development