Lucene - Core
  1. Lucene - Core
  2. LUCENE-3290

add FieldInvertState.numUniqueTerms, Terms.sumDocFreq

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For scoring systems like lnu.ltc (http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf), we need to supply 3 stats:

      • average tf within d
      • # of unique terms within d
      • average number of unique terms across field

      If we add FieldInvertState.numUniqueTerms, you can incorporate the first two into your norms/docvalues (once we cut over),
      the average tf within d being length / numUniqueTerms.

      to compute the average across the field, we can just write the sum of all terms' docfreqs into the terms dictionary header,
      and you can then divide this by maxdoc to get the average.

      1. LUCENE-3290.patch
        28 kB
        Robert Muir
      2. LUCENE-3290.patch
        28 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Patch: I think its ready to commit but before committing I want mike to double-check the unrelated nocommit i added for MemoryCodec.

        Looks like its TermsWriter writes a vLong for sumTotalTermFreq, its TermsReader reads a vInt... maybe we need a Test2BPostings

        Show
        Robert Muir added a comment - Patch: I think its ready to commit but before committing I want mike to double-check the unrelated nocommit i added for MemoryCodec. Looks like its TermsWriter writes a vLong for sumTotalTermFreq, its TermsReader reads a vInt... maybe we need a Test2BPostings
        Hide
        Michael McCandless added a comment -

        You are right – nice catch! Can you change the sumTotalTF to be a readVLong? Thanks.

        Show
        Michael McCandless added a comment - You are right – nice catch! Can you change the sumTotalTF to be a readVLong? Thanks.
        Hide
        Michael McCandless added a comment -

        Patch looks awesome! Nice to add these additional status.

        Show
        Michael McCandless added a comment - Patch looks awesome! Nice to add these additional status.
        Hide
        Robert Muir added a comment -

        i committed the fix to memorycodec, synced the patch up to trunk, and renamed the confusing 'sumDF' variable in termsconsumer, that actually is no sumDF at all

        I think this is ready to go

        Show
        Robert Muir added a comment - i committed the fix to memorycodec, synced the patch up to trunk, and renamed the confusing 'sumDF' variable in termsconsumer, that actually is no sumDF at all I think this is ready to go
        Hide
        Robert Muir added a comment -

        The FieldInvertState.numUniqueTerms portion is backported to 3.x (no collection level stats are in 3.x in general, seems tricky)

        Show
        Robert Muir added a comment - The FieldInvertState.numUniqueTerms portion is backported to 3.x (no collection level stats are in 3.x in general, seems tricky)
        Hide
        Yonik Seeley added a comment -

        Is there currently a way to get the number of documents that have a value in the field?
        Then one could compute the average length of a (sparse) field via sumTotalTermFreq(field)/docsWithField(field)
        docsWithField(field) would be useful in other contexts that want to know how sparse a field is (automatically selecting faceting algorithms, etc).

        Show
        Yonik Seeley added a comment - Is there currently a way to get the number of documents that have a value in the field? Then one could compute the average length of a (sparse) field via sumTotalTermFreq(field)/docsWithField(field) docsWithField(field) would be useful in other contexts that want to know how sparse a field is (automatically selecting faceting algorithms, etc).
        Hide
        Robert Muir added a comment -

        not at the moment, we would have to write this separately.

        Show
        Robert Muir added a comment - not at the moment, we would have to write this separately.
        Hide
        Uwe Schindler added a comment -

        I reopen this one:

        The FieldInvertState.numUniqueTerms portion is backported to 3.x (no collection level stats are in 3.x in general, seems tricky)

        As we backported this, we must add a Lucene 3.4 backwards index to the TestBackwardsCompatibility test. And hopefully this new 3.4 Index format opens sucessfully in trunk!

        Show
        Uwe Schindler added a comment - I reopen this one: The FieldInvertState.numUniqueTerms portion is backported to 3.x (no collection level stats are in 3.x in general, seems tricky) As we backported this, we must add a Lucene 3.4 backwards index to the TestBackwardsCompatibility test. And hopefully this new 3.4 Index format opens sucessfully in trunk!
        Hide
        Robert Muir added a comment -

        Uwe, the index format did not change in 3.x !

        Show
        Robert Muir added a comment - Uwe, the index format did not change in 3.x !
        Hide
        Robert Muir added a comment -

        Just more explanation, there are two parts to the patch:

        1. FieldInvertState gets an additional variable, numUniqueTerms. its not stored anywhere. this just allows you to use this as part of your Similarity.computeNorm calculation, if you like.
        2. in trunk only we store sumDocFreq, which changes the index format. but this is not easy to backport to 3.x, as fields are not clearly separated (which would make it a little tricky), and its missing new stats anyway like totalTermFreq (because it would bloat TermInfos).
        Show
        Robert Muir added a comment - Just more explanation, there are two parts to the patch: FieldInvertState gets an additional variable, numUniqueTerms. its not stored anywhere. this just allows you to use this as part of your Similarity.computeNorm calculation, if you like. in trunk only we store sumDocFreq, which changes the index format. but this is not easy to backport to 3.x, as fields are not clearly separated (which would make it a little tricky), and its missing new stats anyway like totalTermFreq (because it would bloat TermInfos).
        Hide
        Uwe Schindler added a comment -

        OK, sorry for the noise!

        Show
        Uwe Schindler added a comment - OK, sorry for the noise!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development