Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6789

change IndexSearcher default similarity to BM25

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since Lucene 4.0, the statistics needed for this are always present, so we can make the change without any degradation.

      I think the change should be a 6.0 change only: it will prevent any surprises. DefaultSimilarity is renamed to ClassicSimilarity to prevent confusion. No indexing change is needed as we use the same norm format, its just a runtime switch. Users can just do IndexSearcher.setSimilarity(new ClassicSimilarity()) to get the old behavior. I did not change solr's default here, I think that should be a separate issue, since it has more concerns: e.g. factories in configuration files and so on.

      One issue was the generation of synonym queries (posinc=0) by QueryBuilder (used by parsers). This is kind of a corner case (query-time synonyms), but we should make it nicer. The current code in trunk disables coord, which makes no sense for anything but the vector space impl. Instead, this patch adds a SynonymQuery which treats occurrences of any term as a single pseudoterm. With english wordnet as a query-time synonym dict, this query gives 12% improvement in MAP for title queries on BM25, and 2% with Classic (not significant). So its a better generic approach for synonyms that works with all scoring models.

      I wanted to use BlendedTermQuery, but it seems to have problems at a glance, it tries to "take on the world", it has problems like not working with distributed scoring (doesn't consult indexsearcher for stats). Anyway this one is a different, simpler approach, which only works for a single field, and which calls tf(sum) a single time.

      1. LUCENE-6789.patch
        144 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          mikemccand Michael McCandless added a comment -

          +1

          Show
          mikemccand Michael McCandless added a comment - +1
          Hide
          jpountz Adrien Grand added a comment -

          There are still some changes that refer to ClassicSimilarity as the default similarity instead of BM25Similarity, eg.

          - * just tweaking the default implementation: {@link DefaultSimilarity}.
          + * just tweaking the default implementation: {@link ClassicSimilarity}.
          

          Should we change it?

          Also can we make SynonymQuery final?

          Show
          jpountz Adrien Grand added a comment - There are still some changes that refer to ClassicSimilarity as the default similarity instead of BM25Similarity, eg. - * just tweaking the default implementation: {@link DefaultSimilarity}. + * just tweaking the default implementation: {@link ClassicSimilarity}. Should we change it? Also can we make SynonymQuery final?
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          +1
          Lucene's BM25 competes very well against other search engines in terms of retrieval effectiveness and speed.
          Here are benchmarks from the IR reproducibility track of SIGIR 2015 (Santiago, CL): https://github.com/lintool/IR-Reproducibility/blob/master/Gov2.md

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - +1 Lucene's BM25 competes very well against other search engines in terms of retrieval effectiveness and speed. Here are benchmarks from the IR reproducibility track of SIGIR 2015 (Santiago, CL): https://github.com/lintool/IR-Reproducibility/blob/master/Gov2.md
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1703070 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1703070 ]

          LUCENE-6789: change IndexSearcher default similarity to BM25

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1703070 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1703070 ] LUCENE-6789 : change IndexSearcher default similarity to BM25
          Hide
          rcmuir Robert Muir added a comment -

          Thanks Adrien: I applied those fixes.

          Show
          rcmuir Robert Muir added a comment - Thanks Adrien: I applied those fixes.

            People

            • Assignee:
              Unassigned
              Reporter:
              rcmuir Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development