Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1420

Similarity.lengthNorm and positionIncrement=0

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.9
    • 2.9
    • core/index
    • None
    • New

    Description

      Calculation of lengthNorm factor should in some cases take into account the number of tokens with positionIncrement=0. This should be made optional, to support two different scenarios:

      • when analyzers insert artificially constructed tokens into TokenStream (e.g. ASCII-fied versions of accented terms, stemmed terms), and it's unlikely that users submit queries containing both versions of tokens: in this case lengthNorm calculation should ignore the tokens with positionIncrement=0.
      • when analyzers insert synonyms, and it's likely that users may submit queries that contain multiple synonymous terms: in this case the lengthNorm should be calculated as it is now, i.e. it should take into account all terms no matter what is their positionIncrement.

      The default should be backward-compatible, i.e. it should count all tokens.

      (See also the discussion here: http://markmail.org/message/vfvmzrzhr6pya22h )

      Attachments

        1. LUCENE-1420.patch
          24 kB
          Michael McCandless
        2. similarity.patch
          15 kB
          Andrzej Bialecki
        3. similarity-v2.patch
          15 kB
          Andrzej Bialecki

        Activity

          People

            mikemccand Michael McCandless
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: