|
[
Permlink
| « Hide
]
Christian Kohlschütter added a comment - 09/Jul/07 04:53 PM
Adds a switch to enable/disable Hits-based score normalization.
I don't see any harm in adding this, though I don't understand why the top-score was greater than 1 (original description above mentions this) in the first place, since scores in Hits are normalized and thus should always be less than 1.
Normalization is only applied to the queryWeight part of the score (the part the same for all documents), but not to the fieldWeight. idf and norms can both be > 1.
You are right, Yonik.
Hits currently tries to "hide" this by normalizing the scores to a maximum of 1, simply by dividing the "raw" scores by the maximum score returned. This is why the scores from Hits are currently not comparable to each other. The suggested patch resolves this problem.
Do you mean across queries or within queries? Even if you have raw scores, they still won't be comparable across queries, or at least that is my understanding of the literature. In your original case of federated search across several sources, each with their own stats it is not well understood what scores mean. Not saying it can't be done, it really is the only way to do federated search, just not sure one can try to read too much into the scores. Of course, this is more of a discussion for the user list than a JIRA issue, so I'd be happy to discuss more there and hear other thoughts. It has been a while since I have read anything on it. That also isn't to say that your patch isn't worthwhile, just wondering whether the change is actually meaningful for your use case. Grant,
sorry I was perhaps not too clear about it. The distribution of scores of one Hits instance is currently not comparable to another distribution of scores of another Hits object, even if the underlying statistics are comparable/compatible/identical. This is due to the case that the values are always normalized to a maximum of 1.0. As I said, my Federated Search system provides homogeneous statistics (but not via MultiSearcher). In fact, it does not use MultiSearcher for this, but a variant of the SRU/SRW/XCQL protocols ("SRX/FS"), where all communication is done via HTTP and XML. This includes the exchange of Term/DF statistics. At the end, the system makes several distributed Indexes appear as a single (read: federated) index. In order to merge the results from each index, Hits is used. In the simplest case, the results from every Hits object (one per source) are simply merged by score in descending order. With the current implementation of Lucene Hits, these scores are not comparable across instances. With the patch, they are (at least when score normalization is turned off). If you need more information about the Federated Search system, we can indeed move the discussion to the mailing list. However, I think the problem is not really specific to my needs. Even if you have two Hits instances locally, you might want to be able to compare the scores (or merge the results) from Hits instance A to those from Hits instance B (in particular, when they are from the same index). This is also not possible right now. Disabling score normalization in Hits seems like a reasonable feature to me. +1
The indentation in this patch uses tabs. -1 Reworked patch, now against SVN Trunk (also works with 2.3), and without tabs.
The patch file also includes a new testcase which demonstrates the new feature. changed affected versions
I hate to rain on the parade, but maybe instead of making small modifications to the way Hits works, it's time to deprecate it?
Hits has numerous flaws compared to the alternative interface (Searcher.search(Query, HitCollector), with TopDocCollector). It tries to "guess" in advance the number of results it should calculate (usually calculating too many, or too few and having to run the search again). It does bizarre normalization of the score (as this patch points out). It is harder to extend the way the HitCollector interface can be (for an example, see the recently checked-in timed hit collector, replacing yet another suggest improvements to the Hits interface). So I say - it's time to deprecate the Hits search(Query) method, to change the tutorials to recommend TopDocCollector instead, and to stop trying to improve Hits.
+1
I agree with your sentiment, but it's somewhat orthogonal to this issue. If someone opens a Jira issue to deprecate the Hits class, and attaches a patch that does so and replaces examples of it's usage in the demo and tutorial, i'll certainly vote for it – but until then, if people want to try and improve Hits, there's little reason not to do so. I agree with Hoss. Please file a new issue if you want to see Hits (and consequently also Hit/HitIterator) being deprecated. I do not see any reason for this, though.
This patch is meant for helping Lucene users who currently use the Hits class and particularly have problems with the built-in score normalization, and not with its performance. Unit test passes.
Will commit next week.
Done with I suppose there is now suddenly no need to work on Hits. I'll resolve this as Won't Fix in a few days, unless somebody has some more thoughts on this.
Looks like this guy is ready to be resolved.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||