Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3221

improve docvalues integration with scoring

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 6.0
    • core/index
    • None
    • New

    Description

      Currently, the flexscoring branch is limited by the fact that you can at most index one single byte per-document for scoring within Similarity.

      I added a simple test, showing how in your app itself you can index a per-document value (such as a boost) and then use it in scoring: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/TestDocValuesScoring.java

      However, I think we should generalize this mechanism (note, names of classes can be changed to whatver makes sense).
      In Similarity, instead of byte computeNorm(FieldInvertState), I think we should have void computeNorm(StatsWriter, FieldInvertState).

      Then a Similarity can ask the StatsWriter for instance(s), where an instance is something like a (name, type, aggregates) pair.
      Name would be a simple name like "boost" that the sim later uses to retrieve this docvalue. type would be something like int8/int32/varint/byte.
      aggregates could at first be a boolean or whatever, I think at first we should allow for the sum be be written (e.g. to provide sum and average).
      This would support aggregate statistics such as 'total number of tokens in index' and 'average length'.

      so an example of the new computeNorm or whatever we call it would be

        void computeNorm(StatsWriter writer, FieldInvertState state) {
          writer.getReference("length", INT32, Aggregates.YES).write(state.numTokens);
          writer.getReference("boost", FLOAT32, Aggregates.NO).write(state.boost);
          ...
        }
      

      So these docvalues field names that the Sim writes, I think the sim should be able to reference them with "relative" names like length and boost.
      Whatever we do behind the scenes is an implementation detail.

      Also for this to work, I think we need to add int8, int16, int32, ... types to docvalues, and maybe we should add hasArray()/getArray(). I think
      the existing compressed INTS should be kept, but maybe renamed to varint or something like that. This could still be useful, for example if someone
      wants to have "real document lengths" for bm25, but they don't really need a full 32-bit range, they can make the tradeoff to use packed integers
      and load less into ram... but that should be the sim's choice to make.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rcmuir Robert Muir
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated: