Lucene - Core
  1. Lucene - Core
  2. LUCENE-6896

Fix/document various Similarity bugs around extreme norm values

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-6818:

      Ahmet Arslan found problems with every Similarity (except ClassicSimilarity) when trying to test how they behave on every possible norm value, to ensure they are robust for all index-time boosts.

      There are several problems:
      1. buggy normalization decode that causes the smallest possible norm value (0) to be treated as an infinitely long document. These values are intended to be encoded as non-negative finite values, but going to infinity breaks everything.
      2. various problems in the less practical functions that already have documented warnings that they do bad things for extreme values. These impact DFR models D, Be, and P and IB distribution SPL.

        Activity

        Hide
        Robert Muir added a comment -

        Patch:

        • adds tests that normalization decode is well behaved: only finite non-negative values, with increasing boost / decreasing length.
        • fixes normalization decode for BM25Similarity and SimilarityBase to adhere to this, by setting len[0] = 1 / len[255] (5.6493154E19) instead of going to infinity.
        • adds a delta to DFR model P to avoid going infinite when normalized tf approaches zero.
        • adds/elaborates on warnings for the other 3 models and adds TODOs: maybe there is a similar simple fix for those.
        Show
        Robert Muir added a comment - Patch: adds tests that normalization decode is well behaved: only finite non-negative values, with increasing boost / decreasing length. fixes normalization decode for BM25Similarity and SimilarityBase to adhere to this, by setting len [0] = 1 / len [255] (5.6493154E19) instead of going to infinity. adds a delta to DFR model P to avoid going infinite when normalized tf approaches zero. adds/elaborates on warnings for the other 3 models and adds TODOs: maybe there is a similar simple fix for those.
        Hide
        Michael McCandless added a comment -

        +1 to not divide by 0!

        Show
        Michael McCandless added a comment - +1 to not divide by 0!
        Hide
        Adrien Grand added a comment -

        +1

        I'm curious what the reasoning is for

        NORM_TABLE[0] = 1.0f / NORM_TABLE[255];

        Is it just a way to get a high float value that would be unlikely to overflow to Infinity (eg. when multiplied) or is it more than that?

        Show
        Adrien Grand added a comment - +1 I'm curious what the reasoning is for NORM_TABLE[0] = 1.0f / NORM_TABLE[255]; Is it just a way to get a high float value that would be unlikely to overflow to Infinity (eg. when multiplied) or is it more than that?
        Hide
        Robert Muir added a comment -

        Its completely arbitrary. But setting largest value to inverse of the smallest value does not seem too surprising.

        Show
        Robert Muir added a comment - Its completely arbitrary. But setting largest value to inverse of the smallest value does not seem too surprising.
        Hide
        Adrien Grand added a comment -

        That works for me, thanks for the explanation.

        Show
        Adrien Grand added a comment - That works for me, thanks for the explanation.
        Hide
        ASF subversion and git services added a comment -

        Commit 1725178 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1725178 ]

        LUCENE-6896: don't treat smallest possible norm value as an infinitely long doc in SimilarityBase or BM25Similarity

        Show
        ASF subversion and git services added a comment - Commit 1725178 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1725178 ] LUCENE-6896 : don't treat smallest possible norm value as an infinitely long doc in SimilarityBase or BM25Similarity
        Hide
        ASF subversion and git services added a comment -

        Commit 1725181 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1725181 ]

        LUCENE-6896: don't treat smallest possible norm value as an infinitely long doc in SimilarityBase or BM25Similarity

        Show
        ASF subversion and git services added a comment - Commit 1725181 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1725181 ] LUCENE-6896 : don't treat smallest possible norm value as an infinitely long doc in SimilarityBase or BM25Similarity
        Hide
        Robert Muir added a comment -

        I committed the fixes and tests. I did discard my changes (delta) to model P after some investigation, as it does not fix all P's problems with abnormal TF values.

        Show
        Robert Muir added a comment - I committed the fixes and tests. I did discard my changes (delta) to model P after some investigation, as it does not fix all P's problems with abnormal TF values.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development