Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2760

optimize spanfirstquery, spanpositionrangequery

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      SpanFirstQuery and SpanPositionRangeQuery (SpanFirst is just a special case of this), are currently inefficient.

      Take this worst case example: SpanFirstQuery("the").
      Currently the code reads all the positions for the term "the".

      But when enumerating spans, once we have passed the allowable range we should move on to the next document (skipTo)

      1. LUCENE-2760.patch
        8 kB
        Robert Muir
      2. LUCENE-2760.patch
        7 kB
        Robert Muir

        Activity

        Hide
        gsingers Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        gsingers Grant Ingersoll added a comment - Bulk close for 3.1
        Hide
        rcmuir Robert Muir added a comment -

        Committed revision 1035397, 1035411 (3x)

        Show
        rcmuir Robert Muir added a comment - Committed revision 1035397, 1035411 (3x)
        Hide
        rcmuir Robert Muir added a comment -

        I'd like to commit this soon if there are no objections.

        these apis (SpanPositionCheckQuery) are new in 3.x/trunk so theres no backwards break

        Show
        rcmuir Robert Muir added a comment - I'd like to commit this soon if there are no objections. these apis (SpanPositionCheckQuery) are new in 3.x/trunk so theres no backwards break
        Hide
        rcmuir Robert Muir added a comment -

        here's an updated patch, with javadocs.

        additionally i now check for spans.start() >= end() instead of spans.start() > end()

        i believe its invalid to have a zero-length span (e.g. for a single term end = start+1)
        I added an assert to check for this, and all tests still pass.

        Show
        rcmuir Robert Muir added a comment - here's an updated patch, with javadocs. additionally i now check for spans.start() >= end() instead of spans.start() > end() i believe its invalid to have a zero-length span (e.g. for a single term end = start+1) I added an assert to check for this, and all tests still pass.
        Hide
        rcmuir Robert Muir added a comment -

        Admittedly, I don't yet have a good benchmarking setup for these spanqueries yet.

        But from doing a quick test on a 125k doc corpus, the SpanFirstQuery on a common term like "the" took
        about half the time.. this is because it read/evaluated 117,556 positions instead of 1,029,622 positions.

        Show
        rcmuir Robert Muir added a comment - Admittedly, I don't yet have a good benchmarking setup for these spanqueries yet. But from doing a quick test on a 125k doc corpus, the SpanFirstQuery on a common term like "the" took about half the time.. this is because it read/evaluated 117,556 positions instead of 1,029,622 positions.
        Hide
        rcmuir Robert Muir added a comment -

        here's the patch, the SpanPositionCheckQuery now has logic similar to FilteredTermsEnum,
        instead of returning a boolean true/false for whether a match is acceptable,
        it can return YES, NO, NO_AND_ADVANCE

        Show
        rcmuir Robert Muir added a comment - here's the patch, the SpanPositionCheckQuery now has logic similar to FilteredTermsEnum, instead of returning a boolean true/false for whether a match is acceptable, it can return YES, NO, NO_AND_ADVANCE

          People

          • Assignee:
            rcmuir Robert Muir
            Reporter:
            rcmuir Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development