Lucene - Core
  1. Lucene - Core
  2. LUCENE-2187

improve lucene's similarity algorithm defaults

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/query/scoring
    • Labels:
    • Lucene Fields:
      New

      Description

      First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25.

      In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term.

      Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track...

      Here is what I propose in pseudocode (overriding DefaultSimilarity):

        @Override
        public float tf(float freq) {
          return 1 + (float) Math.log(freq);
        }
        
        @Override
        public float lengthNorm(String fieldName, int numTerms) {
          return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
        }
      

      Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it.

      These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far.

      The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all.

      The BM25-like "binary" pivot here works better on the test collections I have run, but of course only with the tf modification.

      I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks)

      so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for.

      also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing!

      1. LUCENE-2187.patch
        5 kB
        Robert Muir
      2. scoring.pdf
        148 kB
        Robert Muir
      3. scoring.pdf
        148 kB
        Robert Muir
      4. scoring.pdf
        125 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          document with some simple results from the 3 collections i tested thus far.

          i chose to display simple graphs with descriptions of the collections and some of their peculiarities.

          if you want submission.txt dumps or verbose output from trec_eval, I can do that too, but I think its less useful to start with.

          Show
          Robert Muir added a comment - document with some simple results from the 3 collections i tested thus far. i chose to display simple graphs with descriptions of the collections and some of their peculiarities. if you want submission.txt dumps or verbose output from trec_eval, I can do that too, but I think its less useful to start with.
          Hide
          Robert Muir added a comment -

          attaching updated document with results for a 4th test collection, on english. for this one BM25 did not fare so well.

          For the lazy, here are the MAP values:

          StandardAnalyzer
          Default Scoring: 0.3837
          BM25 Scoring: 0.3580
          Improved Scoring: 0.3994

          StandardAnalyzer + Porter
          Default Scoring: 0.4333
          BM25 Scoring: 0.4131
          Improved Scoring: 0.4515

          StandardAnalyzer + Porter + MoreLikeThis (top 5 docs)
          Default Scoring: 0.5234
          BM25 Scoring: 0.5087
          Improved Scoring: 0.5474

          Note that 0.5572 was the highest performing MAP on this corpus (Microsoft Research) in FIRE 2008: http://www.isical.ac.in/~fire/paper/Udupa-mls-fire2008.pdf

          Show
          Robert Muir added a comment - attaching updated document with results for a 4th test collection, on english. for this one BM25 did not fare so well. For the lazy, here are the MAP values: StandardAnalyzer Default Scoring: 0.3837 BM25 Scoring: 0.3580 Improved Scoring: 0.3994 StandardAnalyzer + Porter Default Scoring: 0.4333 BM25 Scoring: 0.4131 Improved Scoring: 0.4515 StandardAnalyzer + Porter + MoreLikeThis (top 5 docs) Default Scoring: 0.5234 BM25 Scoring: 0.5087 Improved Scoring: 0.5474 Note that 0.5572 was the highest performing MAP on this corpus (Microsoft Research) in FIRE 2008: http://www.isical.ac.in/~fire/paper/Udupa-mls-fire2008.pdf
          Hide
          Robert Muir added a comment -

          sorry, correct some transposition of axes labels and some grammatical mistakes

          Show
          Robert Muir added a comment - sorry, correct some transposition of axes labels and some grammatical mistakes
          Hide
          Robert Muir added a comment -

          attached is a patch with the Similarity impl. of course you have to manually supply this pivot value (avg doc. length), for now.

          Show
          Robert Muir added a comment - attached is a patch with the Similarity impl. of course you have to manually supply this pivot value (avg doc. length), for now.
          Hide
          Tom Burton-West added a comment -

          Hi Robert,

          Is this implementation made moot by the new GSOC work, or would it still be worth testing this as well as BM25, DFR and INF?

          I can't seem to find a link to the ORP collections. Can you point me to it?
          (I plan to test with our long docs, but thought I would try out some of the ORP collections as well)

          Tom

          Show
          Tom Burton-West added a comment - Hi Robert, Is this implementation made moot by the new GSOC work, or would it still be worth testing this as well as BM25, DFR and INF? I can't seem to find a link to the ORP collections. Can you point me to it? (I plan to test with our long docs, but thought I would try out some of the ORP collections as well) Tom
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development