Lucene - Core
  1. Lucene - Core
  2. LUCENE-6818

Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

    Details

    • Lucene Fields:
      New, Patch Available
    • Flags:
      Patch

      Description

      As explained in the write-up, many state-of-the-art ranking model implementations are added to Apache Lucene.

      This issue aims to include DFI model, which is the non-parametric counterpart of the Divergence from Randomness (DFR) framework.

      DFI is both parameter-free and non-parametric:

      • parameter-free: it does not require any parameter tuning or training.
      • non-parametric: it does not make any assumptions about word frequency distributions on document collections.

      It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.

      For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

      1. LUCENE-6818.patch
        21 kB
        Ahmet Arslan
      2. LUCENE-6818.patch
        21 kB
        Ahmet Arslan
      3. LUCENE-6818.patch
        23 kB
        Ahmet Arslan
      4. LUCENE-6818.patch
        21 kB
        Ahmet Arslan
      5. LUCENE-6818.patch
        19 kB
        Ahmet Arslan

        Issue Links

          Activity

          Hide
          Ahmet Arslan added a comment -

          Patch for DFI. However, with this one TestSimilarity2#testCrazySpans fails.
          Any pointers how to fix this will be really appreciated.

          Show
          Ahmet Arslan added a comment - Patch for DFI. However, with this one TestSimilarity2#testCrazySpans fails. Any pointers how to fix this will be really appreciated.
          Hide
          Robert Muir added a comment -

          It happens when expected = 0, caused by the craziness of how spans score (they will happily score a term that does not exist). In this case totalTermFreq is zero, which makes expected go to zero, and then later the formula produces infinity (which the test checks for)

          The test has this explanation for how spans score terms that don't exist:

              // The problem: "normal" lucene queries create scorers, returning null if terms dont exist
              // This means they never score a term that does not exist.
              // however with spans, there is only one scorer for the whole hierarchy:
              // inner queries are not real queries, their boosts are ignored, etc.
          

          The typical solution is to do something like adjust expected:

              final float expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());
          

          I have not read the paper, but these are things to deal with when integrating into lucene. Another thing to be careful about is ensuring that integration of lucene's boosting is really safe, index-time boosts work on the norm, by making the document appear shorter or longer, so docLen might have a "crazy" value if the user does this.

          Show
          Robert Muir added a comment - It happens when expected = 0, caused by the craziness of how spans score (they will happily score a term that does not exist). In this case totalTermFreq is zero, which makes expected go to zero, and then later the formula produces infinity (which the test checks for) The test has this explanation for how spans score terms that don't exist: // The problem: "normal" lucene queries create scorers, returning null if terms dont exist // This means they never score a term that does not exist. // however with spans, there is only one scorer for the whole hierarchy: // inner queries are not real queries, their boosts are ignored, etc. The typical solution is to do something like adjust expected: final float expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens()); I have not read the paper, but these are things to deal with when integrating into lucene. Another thing to be careful about is ensuring that integration of lucene's boosting is really safe, index-time boosts work on the norm, by making the document appear shorter or longer, so docLen might have a "crazy" value if the user does this.
          Hide
          Ahmet Arslan added a comment -

          This patch prevents infinity score by using +1 trick. Now TestSimilarity2#testCrazySpans passes.

          Show
          Ahmet Arslan added a comment - This patch prevents infinity score by using +1 trick. Now TestSimilarity2#testCrazySpans passes.
          Hide
          Ahmet Arslan added a comment -

          The typical solution is to do something like adjust expected:

          Thanks Robert for the suggestion and explanation. Used the typical solution, its working now.

          I have not read the paper, but these are things to deal with when integrating into lucene.

          For your information, if you want to look at, Terrier 4.0 source tree has this model in DFIC.java

          index-time boosts work on the norm, by making the document appear shorter or longer, so docLen might have a "crazy" value if the user does this.

          I was relying o.a.l.search.similarities.SimilarityBase for this but it looks like all of its subclasses (DFR, IB) have this problem. I included TestSimilarityBase#testNorms method in the new patch to demonstrate the problem. If I am not missing something obvious this is a bug, no?

          Show
          Ahmet Arslan added a comment - The typical solution is to do something like adjust expected: Thanks Robert for the suggestion and explanation. Used the typical solution, its working now. I have not read the paper, but these are things to deal with when integrating into lucene. For your information, if you want to look at, Terrier 4.0 source tree has this model in DFIC.java index-time boosts work on the norm, by making the document appear shorter or longer, so docLen might have a "crazy" value if the user does this. I was relying o.a.l.search.similarities.SimilarityBase for this but it looks like all of its subclasses (DFR, IB) have this problem. I included TestSimilarityBase#testNorms method in the new patch to demonstrate the problem. If I am not missing something obvious this is a bug, no?
          Hide
          Ahmet Arslan added a comment -
          • renamed failing test to TestSimilarityBase#testIndexTimeBoost
          • randomized the test method a bit
          Show
          Ahmet Arslan added a comment - renamed failing test to TestSimilarityBase#testIndexTimeBoost randomized the test method a bit
          Hide
          Robert Muir added a comment -

          It is not a bug, it is just always how the index-time boost in lucene has worked. Boosting a document at index-time is just a way for a user to make it artificially longer or shorter.

          I don't think we should change this, it makes it much easier for people to experiment since all of our scoring models do this the same way. It means you do not have to reindex to change the Similarity, for example.

          Its easy to understand this as "at search time, the similarity sees the "normalized" document length". All I am saying is, these scoring models just have to make sure they don't do something totally nuts (like return negative, Infinity, or NaN scores) if the user index-time boosts with extreme values: extreme values that might not make sense relative to e.g. the collection-level statistics for the field. So in my opinion all that is needed, is to add a `testCrazyBoosts` that looks a lot like `testCrazySpans`, and just asserts those things, ideally across all 256 possible norm values.

          Show
          Robert Muir added a comment - It is not a bug, it is just always how the index-time boost in lucene has worked. Boosting a document at index-time is just a way for a user to make it artificially longer or shorter. I don't think we should change this, it makes it much easier for people to experiment since all of our scoring models do this the same way. It means you do not have to reindex to change the Similarity, for example. Its easy to understand this as "at search time, the similarity sees the "normalized" document length". All I am saying is, these scoring models just have to make sure they don't do something totally nuts (like return negative, Infinity, or NaN scores) if the user index-time boosts with extreme values: extreme values that might not make sense relative to e.g. the collection-level statistics for the field. So in my opinion all that is needed, is to add a `testCrazyBoosts` that looks a lot like `testCrazySpans`, and just asserts those things, ideally across all 256 possible norm values.
          Hide
          Ahmet Arslan added a comment -

          I tried to implement Robert's suggestion at TestSimilarityBase#testCrazyIndexTimeBoosts
          It iterates over all possible norm values and 10 different term frequency tf values. NaN, Infinity, Negative values are checked. But I am note sure about the Negative. Some models can return negative scores for certain terms. For example BM25 returns negative scores for common terms.

          Currently only DFI is tested. Because other models make fail the test in its current form.

          Some random question:

          What is the preferred course of action during scoring when term frequency is greater than document length?

          I think we should simply recommend to use index time boosts only with ClassicSimilarity. I wonder how SweetSpotSimilarity works with index time boosts, where artificially shortening the document length may decrease its rank.

          Show
          Ahmet Arslan added a comment - I tried to implement Robert's suggestion at TestSimilarityBase#testCrazyIndexTimeBoosts It iterates over all possible norm values and 10 different term frequency tf values. NaN, Infinity, Negative values are checked. But I am note sure about the Negative. Some models can return negative scores for certain terms. For example BM25 returns negative scores for common terms. Currently only DFI is tested. Because other models make fail the test in its current form. Some random question: What is the preferred course of action during scoring when term frequency is greater than document length? I think we should simply recommend to use index time boosts only with ClassicSimilarity. I wonder how SweetSpotSimilarity works with index time boosts, where artificially shortening the document length may decrease its rank.
          Hide
          Adrien Grand added a comment -

          I think we should simply recommend to use index time boosts only with ClassicSimilarity.

          If we only recommend on using index-time boosts on a Similarity that is not even the default one, maybe we should remove index-time boosts entirely? I opened https://issues.apache.org/jira/browse/LUCENE-6819

          Show
          Adrien Grand added a comment - I think we should simply recommend to use index time boosts only with ClassicSimilarity. If we only recommend on using index-time boosts on a Similarity that is not even the default one, maybe we should remove index-time boosts entirely? I opened https://issues.apache.org/jira/browse/LUCENE-6819
          Hide
          Ahmet Arslan added a comment -

          Patch updated to current trunk (revision 1713433)

          Show
          Ahmet Arslan added a comment - Patch updated to current trunk (revision 1713433)
          Hide
          ASF subversion and git services added a comment -

          Commit 1725205 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1725205 ]

          LUCENE-6818: Add DFISimilarity implementing the divergence from independence model

          Show
          ASF subversion and git services added a comment - Commit 1725205 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1725205 ] LUCENE-6818 : Add DFISimilarity implementing the divergence from independence model
          Hide
          Robert Muir added a comment -

          Thanks Ahmet Arslan !

          The norms/spans tests were added in LUCENE-6896.

          Rather than a wildcard import, I moved RandomSimilarityProvider to similarities/RandomSimilarity, so its in the correct package. Its just used by LuceneTestCase.newSearcher.

          I ran the test suite a few times to try to find any problems, and did some rudimentary relevance testing of the lucene impl and everything seems ok.

          For the solr factory changes around discountOverlaps, can you make a separate issue for that? I'm concerned that, if the factory is not initialized properly, instead there will be other problems, so maybe that should really be an assertion or something.

          Show
          Robert Muir added a comment - Thanks Ahmet Arslan ! The norms/spans tests were added in LUCENE-6896 . Rather than a wildcard import, I moved RandomSimilarityProvider to similarities/RandomSimilarity, so its in the correct package. Its just used by LuceneTestCase.newSearcher. I ran the test suite a few times to try to find any problems, and did some rudimentary relevance testing of the lucene impl and everything seems ok. For the solr factory changes around discountOverlaps, can you make a separate issue for that? I'm concerned that, if the factory is not initialized properly, instead there will be other problems, so maybe that should really be an assertion or something.
          Hide
          ASF subversion and git services added a comment -

          Commit 1725210 from Robert Muir in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1725210 ]

          LUCENE-6818: Add DFISimilarity implementing the divergence from independence model

          Show
          ASF subversion and git services added a comment - Commit 1725210 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1725210 ] LUCENE-6818 : Add DFISimilarity implementing the divergence from independence model
          Hide
          Ahmet Arslan added a comment -

          Thanks Robert Muir for taking care of this.

          For the solr factory changes around discountOverlaps, can you make a separate issue for that?

          Created SOLR-8570

          Show
          Ahmet Arslan added a comment - Thanks Robert Muir for taking care of this. For the solr factory changes around discountOverlaps, can you make a separate issue for that? Created SOLR-8570

            People

            • Assignee:
              Robert Muir
              Reporter:
              Ahmet Arslan
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development