Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Looking the scorers for PhraseQuery, I think there are some speedups
      we could do:

      • The AND part of the scorer (which advances to the next doc that
        has all the terms), in PhraseScorer.doNext, should do the same
        optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from
        rarest to most frequent. I don't think it should use a linked
        list/firstToLast() that it does today.
      • We do way too much work now when .score() is not called, because
        we go and find all occurrences of the phrase in the doc, whereas
        we should stop only after finding the first and then go and count
        the rest if .score() is called.
      • For the exact case, I think we can use two int arrays to find the
        matches. The first array holds the count of how many times a term
        in the phrase "matched" a phrase starting at that position. When
        that count == the number of terms in the phrase, it's a match.
        The 2nd is a "gen" array (holds docID when that count was last
        touched), to avoid clearing. Ie when incrementing the count, if
        the docID != gen, we reset count to 0. I think this'd be faster
        than the PQ we now use. Downside of this is if you have immense
        docs (position gets very large) we'd need 2 immense arrays.

      It'd be great to do LUCENE-1252 along with this, ie factor
      PhraseScorer into two AND'd sub-scorers (LUCENE-1252 is open for
      this). The first one should be ConjunctionScorer, and the 2nd one
      checks the positions (ie, either the exact or sloppy scorers). This
      would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter
      is applied) we would save CPU by not checking the positions for a doc
      unless all other AND'd clauses accepted the doc.

      1. LUCENE-2410_rewrite.patch
        1 kB
        Robert Muir
      2. LUCENE-2410.patch
        16 kB
        Michael McCandless
      3. LUCENE-2410.patch
        24 kB
        Michael McCandless
      4. LUCENE-2410.patch
        26 kB
        Michael McCandless
      5. LUCENE-2410.patch
        29 kB
        Michael McCandless

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development