[LUCENE-3221] improve docvalues integration with scoring - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 6.0
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

Currently, the flexscoring branch is limited by the fact that you can at most index one single byte per-document for scoring within Similarity.

I added a simple test, showing how in your app itself you can index a per-document value (such as a boost) and then use it in scoring: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/TestDocValuesScoring.java

However, I think we should generalize this mechanism (note, names of classes can be changed to whatver makes sense).
In Similarity, instead of byte computeNorm(FieldInvertState), I think we should have void computeNorm(StatsWriter, FieldInvertState).

Then a Similarity can ask the StatsWriter for instance(s), where an instance is something like a (name, type, aggregates) pair.
Name would be a simple name like "boost" that the sim later uses to retrieve this docvalue. type would be something like int8/int32/varint/byte.
aggregates could at first be a boolean or whatever, I think at first we should allow for the sum be be written (e.g. to provide sum and average).
This would support aggregate statistics such as 'total number of tokens in index' and 'average length'.

so an example of the new computeNorm or whatever we call it would be

  void computeNorm(StatsWriter writer, FieldInvertState state) {
    writer.getReference("length", INT32, Aggregates.YES).write(state.numTokens);
    writer.getReference("boost", FLOAT32, Aggregates.NO).write(state.boost);
    ...
  }

So these docvalues field names that the Sim writes, I think the sim should be able to reference them with "relative" names like length and boost.
Whatever we do behind the scenes is an implementation detail.

Also for this to work, I think we need to add int8, int16, int32, ... types to docvalues, and maybe we should add hasArray()/getArray(). I think
the existing compressed INTS should be kept, but maybe renamed to varint or something like that. This could still be useful, for example if someone
wants to have "real document lengths" for bm25, but they don't really need a full 32-bit range, they can make the tradeoff to use packed integers
and load less into ram... but that should be the sim's choice to make.

Attachments

Issue Links

depends upon

LUCENE-3231 Add fixed size DocValues int variants & expose Arrays where possible

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 20/Jun/11 14:14

Updated:: 28/Aug/22 12:50