Lucene - Core
  1. Lucene - Core
  2. LUCENE-1420

Similarity.lengthNorm and positionIncrement=0

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Calculation of lengthNorm factor should in some cases take into account the number of tokens with positionIncrement=0. This should be made optional, to support two different scenarios:

      • when analyzers insert artificially constructed tokens into TokenStream (e.g. ASCII-fied versions of accented terms, stemmed terms), and it's unlikely that users submit queries containing both versions of tokens: in this case lengthNorm calculation should ignore the tokens with positionIncrement=0.
      • when analyzers insert synonyms, and it's likely that users may submit queries that contain multiple synonymous terms: in this case the lengthNorm should be calculated as it is now, i.e. it should take into account all terms no matter what is their positionIncrement.

      The default should be backward-compatible, i.e. it should count all tokens.

      (See also the discussion here: http://markmail.org/message/vfvmzrzhr6pya22h )

      1. LUCENE-1420.patch
        24 kB
        Michael McCandless
      2. similarity.patch
        15 kB
        Andrzej Bialecki
      3. similarity-v2.patch
        15 kB
        Andrzej Bialecki

        Activity

        Andrzej Bialecki created issue -
        Hide
        Andrzej Bialecki added a comment -

        This patch adds Similarity,length(fieldName, numTokens, numOverlapTokens), and provides backward-compatible implementation in Similarity.java.

        Show
        Andrzej Bialecki added a comment - This patch adds Similarity,length(fieldName, numTokens, numOverlapTokens), and provides backward-compatible implementation in Similarity.java.
        Andrzej Bialecki made changes -
        Field Original Value New Value
        Attachment similarity.patch [ 12392011 ]
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Hide
        Michael McCandless added a comment -

        Looks good, thanks Andrzej. I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - Looks good, thanks Andrzej. I plan to commit in a day or two.
        Hide
        Hoss Man added a comment -

        1) i only skimmed this quickly, but i don't think the changes to SweetSpotSimilarity are back compatible ... setLengthNormFactors has a new arg list.

        2) ditto for the public "Info" constructor in MemoryIndex.java

        3) as long as we are adding a new lengthNorm method that has access to new data about the stream, would it also make sense to pass in fieldState.position? and/or a new count of hte number of times getPositionIncrementGap(fieldInfo.name) is called? Those also seem like they could be useful, and should be just as cheap to keep track of as numOverlap and length. (this occured to me because of recent threads on solr-user asking about lengthNorm and multivalued fields ... there may only be one fieldNorm per field name, but with stats like that we could at least do some interesting things based on the average length of each field value.

        4) independent of #3, we may want to consider making FieldInvertState a public class and passing it directly to lengthNorm ... that way lengthNorm can utilize whatever data it wants, and we can add more available data later without changing the API again. We could even deprecate lengthNorm entirely and add a new FieldInvertState.norm property that a new Similarity.computeNorm(FieldInvertState) could set directly so it could choose to ignore the doc & field boosts altogether if it wanted to.

        Show
        Hoss Man added a comment - 1) i only skimmed this quickly, but i don't think the changes to SweetSpotSimilarity are back compatible ... setLengthNormFactors has a new arg list. 2) ditto for the public "Info" constructor in MemoryIndex.java 3) as long as we are adding a new lengthNorm method that has access to new data about the stream, would it also make sense to pass in fieldState.position? and/or a new count of hte number of times getPositionIncrementGap(fieldInfo.name) is called? Those also seem like they could be useful, and should be just as cheap to keep track of as numOverlap and length. (this occured to me because of recent threads on solr-user asking about lengthNorm and multivalued fields ... there may only be one fieldNorm per field name, but with stats like that we could at least do some interesting things based on the average length of each field value. 4) independent of #3, we may want to consider making FieldInvertState a public class and passing it directly to lengthNorm ... that way lengthNorm can utilize whatever data it wants, and we can add more available data later without changing the API again. We could even deprecate lengthNorm entirely and add a new FieldInvertState.norm property that a new Similarity.computeNorm(FieldInvertState) could set directly so it could choose to ignore the doc & field boosts altogether if it wanted to.
        Hide
        Michael McCandless added a comment -

        Re 1 & 2: I agree. Even though the back compat is "looser" for contribs, we still should not break it unless we have to. In this case we definitely don't have to.

        Re 3 & 4: I like this in principle, but I'm nervous about making FieldInvertState public since it's such a new API. I guess if we put the "this API is new & subject to change" caveat on there, to reserve the right to change it, that's OK. I also like having Similiarity.computeNorm factor in the boost, and extending FieldInverState to track the number of field occurrences that had been processed.

        Andrzej, can you update the patch to address these issues?

        Show
        Michael McCandless added a comment - Re 1 & 2: I agree. Even though the back compat is "looser" for contribs, we still should not break it unless we have to. In this case we definitely don't have to. Re 3 & 4: I like this in principle, but I'm nervous about making FieldInvertState public since it's such a new API. I guess if we put the "this API is new & subject to change" caveat on there, to reserve the right to change it, that's OK. I also like having Similiarity.computeNorm factor in the boost, and extending FieldInverState to track the number of field occurrences that had been processed. Andrzej, can you update the patch to address these issues?
        Hide
        Michael McCandless added a comment -

        Andrzej are you [going to] work out a new patch here? I'd like to bring closure.

        Show
        Michael McCandless added a comment - Andrzej are you [going to] work out a new patch here? I'd like to bring closure.
        Hide
        Andrzej Bialecki added a comment -

        Patch that uses FieldInvertState, as suggested. Changes in other classes are limited to minimum in this patch - users of Similarity other than o.a.l.index.* still use lengthNorm(String, int).

        Show
        Andrzej Bialecki added a comment - Patch that uses FieldInvertState, as suggested. Changes in other classes are limited to minimum in this patch - users of Similarity other than o.a.l.index.* still use lengthNorm(String, int).
        Andrzej Bialecki made changes -
        Attachment similarity-v2.patch [ 12392830 ]
        Hide
        Michael McCandless added a comment -

        Thanks Andrzej.

        I made some further changes (attached new patch):

        • Changed the new Similarity.computeNorm(String, FieldInvertState)
          method to return the float norm instead of setting it on the
          FieldInvertState. I also removed FieldInvertState. {get,set}

          Norm.

        • I put back the changes to SweetSpotSimilarity & MemoryIndex, but
          kept the old APIs (deprecated) for back compat.
        • I added {set/get}

          DiscountOverlaps to DefaultSimilarity, and also
          overrode computeNorm to use discountOverlaps to compute the number
          of tokens to pass to lengthNorm.

        • The Info class is private inside MemoryIndex, even though its ctor
          is public, so, I just added boolean to its ctor (ie, there's no
          back-compat issue).
        • Tweaked javadocs

        So now, both DefaultSimilarity and SweetSpotSimilarity have the option
        to discount overlaps, but do not do so by default.

        Show
        Michael McCandless added a comment - Thanks Andrzej. I made some further changes (attached new patch): Changed the new Similarity.computeNorm(String, FieldInvertState) method to return the float norm instead of setting it on the FieldInvertState. I also removed FieldInvertState. {get,set} Norm. I put back the changes to SweetSpotSimilarity & MemoryIndex, but kept the old APIs (deprecated) for back compat. I added {set/get} DiscountOverlaps to DefaultSimilarity, and also overrode computeNorm to use discountOverlaps to compute the number of tokens to pass to lengthNorm. The Info class is private inside MemoryIndex, even though its ctor is public, so, I just added boolean to its ctor (ie, there's no back-compat issue). Tweaked javadocs So now, both DefaultSimilarity and SweetSpotSimilarity have the option to discount overlaps, but do not do so by default.
        Michael McCandless made changes -
        Attachment LUCENE-1420.patch [ 12392932 ]
        Hide
        Andrzej Bialecki added a comment -

        Thanks for a thorough review. I'm fine with these changes, and the idea to gently introduce discountOverlaps in the Similarity API is great. Please go ahead if there are no further objections.

        Show
        Andrzej Bialecki added a comment - Thanks for a thorough review. I'm fine with these changes, and the idea to gently introduce discountOverlaps in the Similarity API is great. Please go ahead if there are no further objections.
        Hide
        Michael McCandless added a comment -

        OK thanks Andrzej. I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - OK thanks Andrzej. I plan to commit in a day or two.
        Hide
        Michael McCandless added a comment -

        Committed revision 710117. Thanks Andrzej!

        Show
        Michael McCandless added a comment - Committed revision 710117. Thanks Andrzej!
        Michael McCandless made changes -
        Fix Version/s 2.3.3 [ 12313327 ]
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Mark Miller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12444000 ] Default workflow, editable Closed status [ 12562813 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562813 ] jira [ 12583725 ]
        Grant Ingersoll made changes -
        Affects Version/s 2.3.3 [ 12313327 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development