Issue Details (XML | Word | Printable)

Key: LUCENE-954
Type: Improvement Improvement
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: Christian Kohlschütter
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Toggle score normalization in Hits

Created: 09/Jul/07 04:52 PM   Updated: 05/Aug/08 11:33 PM
Return to search
Component/s: Search
Affects Version/s: 2.2, 2.3, 2.3.1, 2.4
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hits-scoreNorm.patch 2007-07-09 04:53 PM Christian Kohlschütter 3 kB
Text File Licensed for inclusion in ASF works LUCENE-954.patch 2008-02-24 06:41 PM Christian Kohlschütter 7 kB
Environment: any

Lucene Fields: Patch Available, New
Resolution Date: 05/Aug/08 11:33 PM


 Description  « Hide
The current implementation of the "Hits" class sometimes performs score normalization.
In particular, whenever the top-ranked score is bigger than 1.0, it is normalized to a maximum of 1.0.

In this case, Hits may return different score results than TopDocs-based methods.

In my scenario (a federated search system), Hits delievered just plain wrong results.
I was merging results from several sources, all having homogeneous statistics (similar to MultiSearcher, but over the Internet using HTTP/XML-based protocols).
Sometimes, some of the sources had a top-score greater than 1, so I ended up with garbled results.

I suggest to add a switch to enable/disable this score-normalization at runtime.
My patch (attached) has an additional peformance benefit, since score normalization now occurs only when Hits#score() is called, not when creating the Hits result list. Whenever scores are not required, you save one multiplication per retrieved hit (i.e., at least 100 multiplications with the current implementation of Hits).



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Christian Kohlschütter added a comment - 09/Jul/07 04:53 PM
Adds a switch to enable/disable Hits-based score normalization.

Grant Ingersoll added a comment - 28/Sep/07 03:33 PM
change the version

Otis Gospodnetic added a comment - 17/Feb/08 08:28 AM
I don't see any harm in adding this, though I don't understand why the top-score was greater than 1 (original description above mentions this) in the first place, since scores in Hits are normalized and thus should always be less than 1.

Yonik Seeley added a comment - 17/Feb/08 01:37 PM
Normalization is only applied to the queryWeight part of the score (the part the same for all documents), but not to the fieldWeight. idf and norms can both be > 1.

Christian Kohlschütter added a comment - 22/Feb/08 10:32 AM
You are right, Yonik.

Hits currently tries to "hide" this by normalizing the scores to a maximum of 1, simply by dividing the "raw" scores by the maximum score returned.

This is why the scores from Hits are currently not comparable to each other. The suggested patch resolves this problem.


Grant Ingersoll added a comment - 22/Feb/08 11:30 AM

This is why the scores from Hits are currently not comparable to each other

Do you mean across queries or within queries? Even if you have raw scores, they still won't be comparable across queries, or at least that is my understanding of the literature. In your original case of federated search across several sources, each with their own stats it is not well understood what scores mean. Not saying it can't be done, it really is the only way to do federated search, just not sure one can try to read too much into the scores. Of course, this is more of a discussion for the user list than a JIRA issue, so I'd be happy to discuss more there and hear other thoughts. It has been a while since I have read anything on it.

That also isn't to say that your patch isn't worthwhile, just wondering whether the change is actually meaningful for your use case.


Christian Kohlschütter added a comment - 22/Feb/08 12:09 PM
Grant,

sorry I was perhaps not too clear about it.

The distribution of scores of one Hits instance is currently not comparable to another distribution of scores of another Hits object, even if the underlying statistics are comparable/compatible/identical. This is due to the case that the values are always normalized to a maximum of 1.0.

As I said, my Federated Search system provides homogeneous statistics (but not via MultiSearcher). In fact, it does not use MultiSearcher for this, but a variant of the SRU/SRW/XCQL protocols ("SRX/FS"), where all communication is done via HTTP and XML. This includes the exchange of Term/DF statistics. At the end, the system makes several distributed Indexes appear as a single (read: federated) index. In order to merge the results from each index, Hits is used.

In the simplest case, the results from every Hits object (one per source) are simply merged by score in descending order. With the current implementation of Lucene Hits, these scores are not comparable across instances. With the patch, they are (at least when score normalization is turned off).

If you need more information about the Federated Search system, we can indeed move the discussion to the mailing list. However, I think the problem is not really specific to my needs. Even if you have two Hits instances locally, you might want to be able to compare the scores (or merge the results) from Hits instance A to those from Hits instance B (in particular, when they are from the same index). This is also not possible right now.


Doug Cutting added a comment - 22/Feb/08 07:37 PM
Disabling score normalization in Hits seems like a reasonable feature to me. +1

The indentation in this patch uses tabs. -1


Christian Kohlschütter added a comment - 24/Feb/08 06:41 PM
Reworked patch, now against SVN Trunk (also works with 2.3), and without tabs.
The patch file also includes a new testcase which demonstrates the new feature.

Christian Kohlschütter added a comment - 25/Feb/08 02:11 PM
changed affected versions

Nadav Har'El added a comment - 16/Mar/08 09:19 PM
I hate to rain on the parade, but maybe instead of making small modifications to the way Hits works, it's time to deprecate it?

Hits has numerous flaws compared to the alternative interface (Searcher.search(Query, HitCollector), with TopDocCollector). It tries to "guess" in advance the number of results it should calculate (usually calculating too many, or too few and having to run the search again). It does bizarre normalization of the score (as this patch points out). It is harder to extend the way the HitCollector interface can be (for an example, see the recently checked-in timed hit collector, replacing yet another suggest improvements to the Hits interface).

So I say - it's time to deprecate the Hits search(Query) method, to change the tutorials to recommend TopDocCollector instead, and to stop trying to improve Hits.


Michael Busch added a comment - 16/Mar/08 09:45 PM

So I say - it's time to deprecate the Hits search(Query) method, to change the tutorials to recommend TopDocCollector instead, and to stop trying to improve Hits.

+1


Hoss Man added a comment - 16/Mar/08 11:28 PM

I hate to rain on the parade, but maybe instead of making small modifications to the way Hits works, it's time to deprecate it?

I agree with your sentiment, but it's somewhat orthogonal to this issue.

If someone opens a Jira issue to deprecate the Hits class, and attaches a patch that does so and replaces examples of it's usage in the demo and tutorial, i'll certainly vote for it – but until then, if people want to try and improve Hits, there's little reason not to do so.


Christian Kohlschütter added a comment - 17/Mar/08 09:45 AM
I agree with Hoss. Please file a new issue if you want to see Hits (and consequently also Hit/HitIterator) being deprecated. I do not see any reason for this, though.

This patch is meant for helping Lucene users who currently use the Hits class and particularly have problems with the built-in score normalization, and not with its performance.


Otis Gospodnetic added a comment - 17/May/08 04:06 AM
Unit test passes.
Will commit next week.

Michael Busch added a comment - 19/May/08 09:06 PM

If someone opens a Jira issue to deprecate the Hits class, and attaches a patch that does so and replaces examples of it's usage in the demo and tutorial, i'll certainly vote for it.

Done with LUCENE-1290.
I'm waiting for your vote, Hoss


Otis Gospodnetic added a comment - 28/May/08 04:58 AM
I suppose there is now suddenly no need to work on Hits. I'll resolve this as Won't Fix in a few days, unless somebody has some more thoughts on this.

Mark Miller added a comment - 05/Aug/08 09:45 PM
Looks like this guy is ready to be resolved.

Michael Busch added a comment - 05/Aug/08 11:33 PM
Hits is deprecated.