Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6711

Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      SimilarityBase.java has the following line :

       long numberOfDocuments = collectionStats.maxDoc();
      

      It seems like collectionStats.docCount(), which returns the total number of documents that have at least one term for this field, is more appropriate statistics here.

      1. LUCENE-6711.patch
        7 kB
        Ahmet Arslan
      2. LUCENE-6711.patch
        5 kB
        Ahmet Arslan
      3. LUCENE-6711.patch
        1 kB
        Ahmet Arslan
      4. LUCENE-6711.patch
        0.9 kB
        Ahmet Arslan

        Issue Links

          Activity

          Hide
          iorixxx Ahmet Arslan added a comment -

          Patch that includes suggested change. However, this breaks most of the tests in TestSimilarityBase. What is the preferred course of action here?

          Show
          iorixxx Ahmet Arslan added a comment - Patch that includes suggested change. However, this breaks most of the tests in TestSimilarityBase . What is the preferred course of action here?
          Hide
          rcmuir Robert Muir added a comment -

          IndexReader/Terms etc still document this as an optional statistic: I think we should keep it that way. E.G. maybe its hard to compute for some FilterReader, who knows.

          So I think we should do a fallback like the other statistics: check for -1 and use maxDoc if its unsupported.

          But I think its a good time to make the change. For ordinary users, it will not be trappy/happen incrementally: all these statistics have been supported since 4.0. We should fix TFIDFSimilarity and BM25Similarity too.

          Show
          rcmuir Robert Muir added a comment - IndexReader/Terms etc still document this as an optional statistic: I think we should keep it that way. E.G. maybe its hard to compute for some FilterReader, who knows. So I think we should do a fallback like the other statistics: check for -1 and use maxDoc if its unsupported. But I think its a good time to make the change. For ordinary users, it will not be trappy/happen incrementally: all these statistics have been supported since 4.0. We should fix TFIDFSimilarity and BM25Similarity too.
          Hide
          upayavira Upayavira added a comment -

          I've often wondered the same sort of thing.

          Now, given that this will likely change the score for every single query anyone does on any Lucene based search, would it be possible to make this configurable, so that people can choose which one they want? More particularly, to choose the point at which their scoring will change?

          Show
          upayavira Upayavira added a comment - I've often wondered the same sort of thing. Now, given that this will likely change the score for every single query anyone does on any Lucene based search, would it be possible to make this configurable, so that people can choose which one they want? More particularly, to choose the point at which their scoring will change?
          Hide
          rcmuir Robert Muir added a comment -

          This is already a pluggable api so someone can do that if they want: lets not make our code complicated. I also think its best to just make the change for trunk and not do it in a minor version.

          Show
          rcmuir Robert Muir added a comment - This is already a pluggable api so someone can do that if they want: lets not make our code complicated. I also think its best to just make the change for trunk and not do it in a minor version.
          Hide
          iorixxx Ahmet Arslan added a comment -

          This patch checks for -1 and uses maxDoc() if docCount() is not unsupported.

          Show
          iorixxx Ahmet Arslan added a comment - This patch checks for -1 and uses maxDoc() if docCount() is not unsupported.
          Hide
          iorixxx Ahmet Arslan added a comment -

          We should fix TFIDFSimilarity and BM25Similarity too.

          For TFIDF and BM25, do we simply replace

          collectionStats.maxDoc()

          with

          collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount()

          ?

          Show
          iorixxx Ahmet Arslan added a comment - We should fix TFIDFSimilarity and BM25Similarity too. For TFIDF and BM25, do we simply replace collectionStats.maxDoc() with collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount() ?
          Hide
          rcmuir Robert Muir added a comment -

          yes, thats right

          Show
          rcmuir Robert Muir added a comment - yes, thats right
          Hide
          iorixxx Ahmet Arslan added a comment -

          Includes changes to TFIDF and BM25, ant clean test passes.

          Show
          iorixxx Ahmet Arslan added a comment - Includes changes to TFIDF and BM25, ant clean test passes.
          Hide
          hossman Hoss Man added a comment -

          I also think its best to just make the change for trunk and not do it in a minor version.

          +1

          This is the sort of behavior change that should be noted in MIGRATE.txt – Ahmet: could you take a stab at adding the necessary text in your patch?

          Show
          hossman Hoss Man added a comment - I also think its best to just make the change for trunk and not do it in a minor version. +1 This is the sort of behavior change that should be noted in MIGRATE.txt – Ahmet: could you take a stab at adding the necessary text in your patch?
          Hide
          iorixxx Ahmet Arslan added a comment -

          Patch that includes following migrate entry. But I am not sure this is an appropriate text for migrate.txt.

          The way how number of document calculated is changed (LUCENE-6711)

          The number of documents (docCount) is used to calculate term specificity (idf) and average document length (avdl). Prior to LUCENE-6711, collectionStats.maxDoc() was used for the statistics. Now, collectionStats.docCount() is used whenever possible, if not maxDocs() is used.

          Assume that a collection contains 100 documents, and 50 of them have "keywords" field. In this example, maxDocs is 100 while docCount is 50 for the "keywords" field. The total number of tokens for "keywords" field is divided by docCount to obtain avdl. Therefore, docCount which is the total number of documents that have at least one term for the field, is a more precise metric for optional fields.

          DefaultSimilarity does not leverage avdl, so this change would have relatively minor change in the result list. Because relative idf values of terms will remain same. However, when combined with other factors such as term frequency, relative ranking of documents could change. Some Similarity implementations (such as the ones instantiated with NormalizationH2 and BM25) take account into avdl and would have notable change in ranked list. Especially if you have a collection of documents with varying lengths. Because NormalizationH2 tends to punish documents longer than avdl.

          Show
          iorixxx Ahmet Arslan added a comment - Patch that includes following migrate entry. But I am not sure this is an appropriate text for migrate.txt. The way how number of document calculated is changed ( LUCENE-6711 ) The number of documents (docCount) is used to calculate term specificity (idf) and average document length (avdl). Prior to LUCENE-6711 , collectionStats.maxDoc() was used for the statistics. Now, collectionStats.docCount() is used whenever possible, if not maxDocs() is used. Assume that a collection contains 100 documents, and 50 of them have "keywords" field. In this example, maxDocs is 100 while docCount is 50 for the "keywords" field. The total number of tokens for "keywords" field is divided by docCount to obtain avdl. Therefore, docCount which is the total number of documents that have at least one term for the field, is a more precise metric for optional fields. DefaultSimilarity does not leverage avdl, so this change would have relatively minor change in the result list. Because relative idf values of terms will remain same. However, when combined with other factors such as term frequency, relative ranking of documents could change. Some Similarity implementations (such as the ones instantiated with NormalizationH2 and BM25) take account into avdl and would have notable change in ranked list. Especially if you have a collection of documents with varying lengths. Because NormalizationH2 tends to punish documents longer than avdl.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1695744 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1695744 ]

          LUCENE-6711: Use CollectionStatistics.docCount() for IDF and average field length computations

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1695744 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1695744 ] LUCENE-6711 : Use CollectionStatistics.docCount() for IDF and average field length computations
          Hide
          rcmuir Robert Muir added a comment -

          Thanks Ahmet, I committed this. I only made cosmetic changes: I renamed local variable and parameter names to be "docCount" because numDocs is pretty confusing. I also added a test case for all of our similarities.

          Show
          rcmuir Robert Muir added a comment - Thanks Ahmet, I committed this. I only made cosmetic changes: I renamed local variable and parameter names to be "docCount" because numDocs is pretty confusing. I also added a test case for all of our similarities.
          Hide
          steve_rowe Steve Rowe added a comment -
          Show
          steve_rowe Steve Rowe added a comment - testNoFieldSkew() failure: https://issues.apache.org/jira/browse/LUCENE-6751
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1696807 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1696807 ]

          LUCENE-6711: improve test when it fails

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1696807 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1696807 ] LUCENE-6711 : improve test when it fails
          Hide
          rcmuir Robert Muir added a comment -

          Thanks. I think it found an unrelated bug.

          Show
          rcmuir Robert Muir added a comment - Thanks. I think it found an unrelated bug.
          Hide
          hossman Hoss Man added a comment -

          possible bug identified by Terry Smith in LUCENE-6758

          Show
          hossman Hoss Man added a comment - possible bug identified by Terry Smith in LUCENE-6758
          Hide
          rcmuir Robert Muir added a comment -

          I dont think its a bug with this.

          Likely the typical bugs from crappy useless querynorm, and exposed by shaking things up.

          Show
          rcmuir Robert Muir added a comment - I dont think its a bug with this. Likely the typical bugs from crappy useless querynorm, and exposed by shaking things up.

            People

            • Assignee:
              rcmuir Robert Muir
              Reporter:
              iorixxx Ahmet Arslan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development