Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3290

add FieldInvertState.numUniqueTerms, Terms.sumDocFreq

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For scoring systems like lnu.ltc (http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf), we need to supply 3 stats:

      • average tf within d
      • # of unique terms within d
      • average number of unique terms across field

      If we add FieldInvertState.numUniqueTerms, you can incorporate the first two into your norms/docvalues (once we cut over),
      the average tf within d being length / numUniqueTerms.

      to compute the average across the field, we can just write the sum of all terms' docfreqs into the terms dictionary header,
      and you can then divide this by maxdoc to get the average.

        Attachments

        1. LUCENE-3290.patch
          28 kB
          Robert Muir
        2. LUCENE-3290.patch
          28 kB
          Robert Muir

          Activity

            People

            • Assignee:
              rcmuir Robert Muir
              Reporter:
              rcmuir Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: