[LUCENE-2410] Optimize PhraseQuery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: core/search
Labels:
None

Lucene Fields:

New

Description

Looking the scorers for PhraseQuery, I think there are some speedups
we could do:

The AND part of the scorer (which advances to the next doc that
has all the terms), in PhraseScorer.doNext, should do the same
optimizing as BooleanQuery's ConjunctionScorer, ie sort terms from
rarest to most frequent. I don't think it should use a linked
list/firstToLast() that it does today.

We do way too much work now when .score() is not called, because
we go and find all occurrences of the phrase in the doc, whereas
we should stop only after finding the first and then go and count
the rest if .score() is called.

For the exact case, I think we can use two int arrays to find the
matches. The first array holds the count of how many times a term
in the phrase "matched" a phrase starting at that position. When
that count == the number of terms in the phrase, it's a match.
The 2nd is a "gen" array (holds docID when that count was last
touched), to avoid clearing. Ie when incrementing the count, if
the docID != gen, we reset count to 0. I think this'd be faster
than the PQ we now use. Downside of this is if you have immense
docs (position gets very large) we'd need 2 immense arrays.

It'd be great to do ~~LUCENE-1252~~ along with this, ie factor
PhraseScorer into two AND'd sub-scorers (~~LUCENE-1252~~ is open for
this). The first one should be ConjunctionScorer, and the 2nd one
checks the positions (ie, either the exact or sloppy scorers). This
would mean if the PhraseQuery is AND'd w/ other clauses (or, a filter
is applied) we would save CPU by not checking the positions for a doc
unless all other AND'd clauses accepted the doc.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2410.patch
17/Jun/10 16:17
29 kB
Michael McCandless
LUCENE-2410.patch
17/Jun/10 10:38
26 kB
Michael McCandless
LUCENE-2410.patch
17/Jun/10 09:48
24 kB
Michael McCandless
LUCENE-2410.patch
16/Jun/10 18:38
16 kB
Michael McCandless
LUCENE-2410_rewrite.patch
12/May/10 03:13
1 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: Michael McCandless

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Apr/10 16:44

Updated:: 28/Aug/22 12:25

Resolved:: 24/Jun/10 10:39