First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25.
In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term.
Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track...
Here is what I propose in pseudocode (overriding DefaultSimilarity):
Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it.
These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far.
The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all.
The BM25-like "binary" pivot here works better on the test collections I have run, but of course only with the tf modification.
I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks)
so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for.
also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing!