[LUCENE-10236] CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.1
Component/s: modules/sandbox
Labels:
None

Lucene Fields:

New

Description

This is a spin-off issue from discussion in https://github.com/apache/lucene/pull/418#issuecomment-967790816, for a quick fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed fields object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is built from looping over fieldTerms, resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer.

E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc:

CombinedFieldQuery query =
    new CombinedFieldQuery.Builder()
        .addField("field1", (float) 1.0)
        .addField("field2", (float) 1.0)
        .addTerm(new BytesRef("foo"))
        .addTerm(new BytesRef("zoo"))
        .build();

I would imagine the scoring to be based on the following:

Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo)
Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:

Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo)
Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2)

In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which uses fieldAndWeights.values() and does not contain duplicated field-weight pairs.