[LUCENE-9399] Wrong cost calculation in prefix/wildcard queries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 8.5.2
Fix Version/s: None
Component/s: core/other
Labels:
None

Lucene Fields:

New

Description

When running a prefix query that matches more than 16 terms a BitSet is built for that subquery with the DocIdSetBuilder.
The cost of the subquery calculated in that class too.
If I get it right, the idea behind the calculation is to get the frequency of every matched term, sum it all and then divide by the average number of terms per doc. It makes sense but it seems like the sum is not calculated properly:

public void add(DocIdSetIterator iter) throws IOException {
 if (bitSet != null) {
 bitSet.or(iter);
 return;
 }
 int cost = (int) Math.min(Integer.MAX_VALUE, iter.cost());
 BulkAdder adder = grow(cost);

Instead of adding the frequency of every term, it stops after the `grow()` creates a bitset. From that moment add() will not call grow and the "counter" variable that stands for the sum of all frequencies will not grow anymore.

In the end, it can create a prefix query with a full bitset (or almost full) that has a low cost. That query can become a leader in a conjunction query.

It's important to say that even if the function is fixed as mentioned above, the cost might still result in a much lower cost than it should.
For example, If we look for all the terms which start with "strin*" in a java_code field. String term and 20 more terms will match that query. The String Term will match most of the documents but the cost will be pretty low because of the calculation (sum_of_frequencies / avarage_number_of_terms_per_doc).

Maybe we can count every new bit added to the bitset.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nir Finkelstein

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Jun/20 16:34

Updated:: 28/Aug/22 16:04