Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Looking the scorers for PhraseQuery, I think there are some speedups
      we could do:

      • The AND part of the scorer (which advances to the next doc that
        has all the terms), in PhraseScorer.doNext, should do the same
        optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from
        rarest to most frequent. I don't think it should use a linked
        list/firstToLast() that it does today.
      • We do way too much work now when .score() is not called, because
        we go and find all occurrences of the phrase in the doc, whereas
        we should stop only after finding the first and then go and count
        the rest if .score() is called.
      • For the exact case, I think we can use two int arrays to find the
        matches. The first array holds the count of how many times a term
        in the phrase "matched" a phrase starting at that position. When
        that count == the number of terms in the phrase, it's a match.
        The 2nd is a "gen" array (holds docID when that count was last
        touched), to avoid clearing. Ie when incrementing the count, if
        the docID != gen, we reset count to 0. I think this'd be faster
        than the PQ we now use. Downside of this is if you have immense
        docs (position gets very large) we'd need 2 immense arrays.

      It'd be great to do LUCENE-1252 along with this, ie factor
      PhraseScorer into two AND'd sub-scorers (LUCENE-1252 is open for
      this). The first one should be ConjunctionScorer, and the 2nd one
      checks the positions (ie, either the exact or sloppy scorers). This
      would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter
      is applied) we would save CPU by not checking the positions for a doc
      unless all other AND'd clauses accepted the doc.

      1. LUCENE-2410_rewrite.patch
        1 kB
        Robert Muir
      2. LUCENE-2410.patch
        29 kB
        Michael McCandless
      3. LUCENE-2410.patch
        26 kB
        Michael McCandless
      4. LUCENE-2410.patch
        24 kB
        Michael McCandless
      5. LUCENE-2410.patch
        16 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Hide
        Michael McCandless added a comment -

        Another thing we should fix – PhraseQuery of a single term should rewrite to TermQuery.

        Show
        Michael McCandless added a comment - Another thing we should fix – PhraseQuery of a single term should rewrite to TermQuery.
        Hide
        Robert Muir added a comment -

        just doing the easy part here, here's the rewrite patch.
        I checked, MultiPhraseQuery already has it.

        Show
        Robert Muir added a comment - just doing the easy part here, here's the rewrite patch. I checked, MultiPhraseQuery already has it.
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-2410_rewrite.patch [ 12444271 ]
        Hide
        Michael McCandless added a comment -

        Looks great Robert – I think you should go ahead & commit that and we'll work on the rest of these optos later.

        Show
        Michael McCandless added a comment - Looks great Robert – I think you should go ahead & commit that and we'll work on the rest of these optos later.
        Hide
        Robert Muir added a comment -

        Committed revisions 943493 (trunk), 943499 (3x)

        Show
        Robert Muir added a comment - Committed revisions 943493 (trunk), 943499 (3x)
        Hide
        Michael McCandless added a comment -

        Attached initial rough patch, doing the 1st and 3rd bullets above.
        Still many nocommits, but all tests pass.

        I only did this for the exact case (I don't understand the sloppy
        case!), so I modified ExactPhraseScorer to no longer subclass
        PhraseScorer and instead do everything on its own.

        I tested on a 20M doc Wikipedia index, best of 10 runs:

        Query No. hits Trunk QPS Patch QPS Speedup
        United States 314K 4.29 11.04 2.6X faster
        United Kingdom Parliament 7K 20.33 58.57 2.9X faster

        The speedup is great

        However, there's one problem w/ the patch that I must fix (and will
        bring these gains down), which is it requires 2 int arrays sized to
        the max position encountered during the search (which for a large doc
        could be very large). I think to make this committable I'd have to
        switch to processing the positions in chunks (like BooleanScorer).

        Show
        Michael McCandless added a comment - Attached initial rough patch, doing the 1st and 3rd bullets above. Still many nocommits, but all tests pass. I only did this for the exact case (I don't understand the sloppy case!), so I modified ExactPhraseScorer to no longer subclass PhraseScorer and instead do everything on its own. I tested on a 20M doc Wikipedia index, best of 10 runs: Query No. hits Trunk QPS Patch QPS Speedup United States 314K 4.29 11.04 2.6X faster United Kingdom Parliament 7K 20.33 58.57 2.9X faster The speedup is great However, there's one problem w/ the patch that I must fix (and will bring these gains down), which is it requires 2 int arrays sized to the max position encountered during the search (which for a large doc could be very large). I think to make this committable I'd have to switch to processing the positions in chunks (like BooleanScorer).
        Michael McCandless made changes -
        Attachment LUCENE-2410.patch [ 12447251 ]
        Hide
        Michael McCandless added a comment -

        New patch attached, switches over to chunking (processing the positions 4096 at once).

        United States is now 10.66 QPS (2.5X speedup) and United Kingdom Parliament is now 54.93 QPS (2.7X speedup).

        I think it's ready to commit... I'll wait a few days.

        Show
        Michael McCandless added a comment - New patch attached, switches over to chunking (processing the positions 4096 at once). United States is now 10.66 QPS (2.5X speedup) and United Kingdom Parliament is now 54.93 QPS (2.7X speedup). I think it's ready to commit... I'll wait a few days.
        Michael McCandless made changes -
        Attachment LUCENE-2410.patch [ 12447328 ]
        Michael McCandless made changes -
        Fix Version/s 3.1 [ 12314822 ]
        Hide
        Michael McCandless added a comment -

        Attached patch w/ another optimization: if the freq of the 2 rarest terms in the phrase are "closish", then just use .nextDoc() instead of .advance() when ANDing. This buys another 15% speedup (12.30 QPS, net 2.9X faster than trunk) on United States phrase query.

        Also, I fixed MultiPhraseQuery to sort its clauses by approx docFreq; the optimization is approx in this case because we can't efficiently compute the docFreq of a position that's unioning > 1 term.

        Show
        Michael McCandless added a comment - Attached patch w/ another optimization: if the freq of the 2 rarest terms in the phrase are "closish", then just use .nextDoc() instead of .advance() when ANDing. This buys another 15% speedup (12.30 QPS, net 2.9X faster than trunk) on United States phrase query. Also, I fixed MultiPhraseQuery to sort its clauses by approx docFreq; the optimization is approx in this case because we can't efficiently compute the docFreq of a position that's unioning > 1 term.
        Michael McCandless made changes -
        Attachment LUCENE-2410.patch [ 12447334 ]
        Hide
        Michael McCandless added a comment -

        New patch – makes the "useAdvance" per-Term, and, adds a safety fallback to .advance if too many .nextDocs are used.

        Show
        Michael McCandless added a comment - New patch – makes the "useAdvance" per-Term, and, adds a safety fallback to .advance if too many .nextDocs are used.
        Michael McCandless made changes -
        Attachment LUCENE-2410.patch [ 12447357 ]
        Hide
        Yonik Seeley added a comment -

        Fantastic! Phrase queries have often been a bottleneck.

        Show
        Yonik Seeley added a comment - Fantastic! Phrase queries have often been a bottleneck.
        Hide
        Michael Busch added a comment -

        Very nice, Mike!

        Another improvement we could make for positional queries (phrases, span queries) would be skip lists on the positions, maybe in a different codec? This would probably be a nice speedup for large docs.

        Show
        Michael Busch added a comment - Very nice, Mike! Another improvement we could make for positional queries (phrases, span queries) would be skip lists on the positions, maybe in a different codec? This would probably be a nice speedup for large docs.
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Michael McCandless added a comment -

        Alas.... I think I somehow screwed up my performance tests above.

        I'm testing search perf (working on LUCENE-2504), and in comparing search perf from 2.9.x -> 3.x, I only saw a ~20% speedup on the phrase query "united states", for a 5M doc Wikipedia index. And, re-running the test on trunk pre and post this commit, I still see only ~20% gain.... still not sure what I did wrong.

        I'll update CHANGES. Two steps forward, one step back... sigh.

        Show
        Michael McCandless added a comment - Alas.... I think I somehow screwed up my performance tests above. I'm testing search perf (working on LUCENE-2504 ), and in comparing search perf from 2.9.x -> 3.x, I only saw a ~20% speedup on the phrase query "united states", for a 5M doc Wikipedia index. And, re-running the test on trunk pre and post this commit, I still see only ~20% gain.... still not sure what I did wrong. I'll update CHANGES. Two steps forward, one step back... sigh.
        Mark Thomas made changes -
        Workflow jira [ 12509120 ] Default workflow, editable Closed status [ 12564301 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564301 ] jira [ 12584836 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development