[LUCENE-2187] improve lucene's similarity algorithm defaults - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 4.9, 6.0
Component/s: core/query/scoring
Labels:
- dead

Lucene Fields:

New

Description

First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25.

In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term.

Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track...

Here is what I propose in pseudocode (overriding DefaultSimilarity):

  @Override
  public float tf(float freq) {
    return 1 + (float) Math.log(freq);
  }
  
  @Override
  public float lengthNorm(String fieldName, int numTerms) {
    return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
  }

Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it.

These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far.

The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all.

The BM25-like "binary" pivot here works better on the test collections I have run, but of course only with the tf modification.

I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks)

so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for.

also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2187.patch
05/Jan/10 03:43
5 kB
Robert Muir
scoring.pdf
04/Jan/10 21:37
148 kB
Robert Muir
scoring.pdf
04/Jan/10 21:28
148 kB
Robert Muir
scoring.pdf
02/Jan/10 20:17
125 kB
Robert Muir

Issue Links

is related to

LUCENE-2186 First cut at column-stride fields (index values storage)

Reopened

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Jan/10 20:15

Updated:: 28/Aug/22 12:18