Lucene - Core
  1. Lucene - Core
  2. LUCENE-2286

enable DefaultSimilarity.setDiscountOverlaps by default

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think we should enable setDiscountOverlaps in DefaultSimilarity by default.

      If you are using synonyms or commongrams or a number of other 0-posInc-term-injecting methods, these currently screw up your length normalization.
      These terms have a position increment of zero, so they shouldnt count towards the length of the document.

      I've done relevance tests with persian showing the difference is significant, and i think its a big trap to anyone using synonyms, etc: your relevance can actually get worse if you don't flip this boolean flag.

        Activity

        Hide
        Robert Muir added a comment -

        attached is a patch, with backwards-break in CHANGES.

        Show
        Robert Muir added a comment - attached is a patch, with backwards-break in CHANGES.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Michael McCandless added a comment -

        Patch looks good (trivial).

        Show
        Michael McCandless added a comment - Patch looks good (trivial).
        Hide
        Robert Muir added a comment -

        ok, i will commit in a few days if no one objects. In my opinion the backwards break is the easiest way to go.

        in practice it won't hurt existing docs, and if someone is concerned about bad ranking (because the newly indexed docs suddenly are ranked better), they can turn this off with the boolean until the get a chance to reindex all docs.

        Show
        Robert Muir added a comment - ok, i will commit in a few days if no one objects. In my opinion the backwards break is the easiest way to go. in practice it won't hurt existing docs, and if someone is concerned about bad ranking (because the newly indexed docs suddenly are ranked better), they can turn this off with the boolean until the get a chance to reindex all docs.
        Hide
        Robert Muir added a comment -

        Committed revision 917148.

        Show
        Robert Muir added a comment - Committed revision 917148.
        Hide
        Koji Sekiguchi added a comment -

        according to CHANGES.txt, this fix is in branch_3x as well.

        Show
        Koji Sekiguchi added a comment - according to CHANGES.txt, this fix is in branch_3x as well.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development