|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 19/May/08 09:32 PM
FWIW, the Hits API was originally designed to support desktop applications, with a scrollable pane of hits. I wonder if anyone ever actually used it that way, and, if so, whether it worked well...
That's how it was used in the SeartchFiles demo and it worked quite well for this small tool. But I believe in real-world apps Hits has quite a few drawbacks. For example, it always As the new demo shows, it is quite easy to implement paging without Hits, just by using FYI: when applying the patch, i got two failed hunks in src/test/org/apache/lucene/search/TestCustomSearcherSort.java – but i didn't dig into why, possibly due to eol-style not being set on the files or perhaps because of tabs (see below) The only files I looked at closely where the Hits class, the demo code, and the docs... We should probably make the deprecation of Hits more obvious in it's class level javadocs ... i would even go so far as to suggest including sample code showing the difference for people who are use to using hits. The new SearchFiles.java demo app seems straight forward and easy to understand, although we should probably add some more javadocs to doPagingSearch and doStreamingSearch to explain what we mean by those concepts. The new code in both of these methods (ie: lines using HitCollector and TopDocCollector) seem to be indented with tabs and not space characters ... so that should be normalized (i didn't check other files modified by the patch to see if they were likewise affected) The changes to demo2.xml seem straight forward enough ... although I notice Hits is still referenced in scoring.xml so that should be changed also. Thanks for the review, Hoss!
Your suggestions make sense. I'll update the patch. I'm actually feeling -1-ish about this. I don't think Hits are hurting those who are truly concerned about performance. Those who want performance have other API options. But Hits is so nice and simple, and that must be valuable to a large portion of Lucene users (think CD searches, site searches, desktop search apps, etc., not massive distributed searches and such).
Why can't we let Hits live? If we are concerned about its performance, we can easily javadoc and Wiki that. Otis,
what do you think are the advantages of Hits over the TopDocCollector/TopDocs approach? If you look at the demo in this patch then you'll notice that I actually didn't make many changes and that it's just as easy to do paging without Hits. I think almost everybody who wants to have a decent control over performance and memory consumption during hit collection should use a HitCollector. From the more or less regular mailing list discussions on performance issues with Hits I think one can safely conclude that the real problem is that everyone who tries to use Hits "for something else" loses time. Quite a while ago this also happened to me, and I still remember that enough to vote for deprecation.
New version of the patch:
Hoss, could you try if this patch applies cleanly now and all tests pass, please? Wow, I didn't imagine before that a patch that simply deprecates one class -1 from me for the current solution.
Deprecating Hits necessarily means deprecating HitIterator. With Hits/HitIterator we have two really simple ways to iterate over a long list of search results. The TopDocs/HitCollector-based approach is basically one level below Hits, and thus, Hits can clearly be regarded a convenience class then. It is not as flexible as HitCollector, but serves its purpose very well. What could make sense is to deprecate the Searcher#search() methods which return a Hits instance, to reduce API clutter. Hits could have a public constructor that takes a Searcher instance instead.
I think this is exactly a problem of Hits. If you use an HitIterator to iterate over let's say IMO Hits only makes sense if you want to use it for paging or, as Doug pointed out, for
What are in your opinion the advantages of using an Iterator interface instead of looping over Michael,
the current implementation of Hits certainly has its deficiencies, but represents a very simple way to retrieve documents from Lucene. As long as there is no real replacement, I simply do not a reason to deprecate it. A replacement could be an API which allows something like: for(Iterator<ScoreDoc> it = searcher.iterator(query); it.hasNext(); ) {
That would duplicate the search methods that use a HitCollector. The HitCollerctor and Iterator approach only supports forward displaying of results. In a typical Google-like Website, where the user can just jump to page X, display some results and can jump back to page Y and display results from there too, Hits works really good. The problem with Hits is, that it returns and caches whole "Documents". If it could just return ScoreDocs and would implement the Java Collection API "List" Interface (using AbstractList), that would be a good replacement for "navigateable" result sets.
Michael:
The HitCollector callback is called in index order (or in any other, non-deterministic order), whereas the results in Hits are sorted (by relevance or any given Sort order). Uwe: Christian:
For my project "panFMP", I created a List-based wrapper on top of current "Hits", which also uses some special assumptions on the index structure behind (which is used by panFMP): http://panfmp.svn.sourceforge.net/viewvc/panfmp/main/trunk/src/de/pangaea/metadataportal/search/SearchResultList.java?view=markup On my plan is to change this to use TopDocCollector and only cache the ScoreDoc instances (id and score). My example just shows, how AbstractList may be used. The problem with AbstractList is, that get() cannot throw Exceptions, but you get the Iterator for free. For better performance, your could also overwrite iterator() and use a sequential TopDocs-based approach for fetching results. With the Java Collections List you could also do this syntactical sugar int start=0,count=10;
Even if you want to show results 100-200, then you still need to sort at least 200 results, also Hits needs to, it doesn't do any magic. Then you can as well keep the ScoreDoc[] array to page back.
Hits uses TopDocCollector/TopDocs for sorting. The same can be achieved using those APIs, which Searcher offers, directly. Please take a look at the patch. I replaced a lot of usages of Hits in the test cases with the TopDocs APIs. In most cases this was a one line change. Of course I don't want to remove something that adds real value to Lucene. IMO, Hits does not. It is a class whose APIs we always discourage to use. Such a class does not belong in the core. The HitCollector (especially TopDocCollector) APIs are similar to use and offer more control, flexibility and insight.
So I think we should deprecate Hits (it will still be part then of 2.4 and 2.9 and removed in 3.0). In the meantime, if people really think that there should be a replacement for Hits in the future, we can develop a new helper class in contib that doesn't have the drawbacks that Hits has, and contributors like Uwe and Christian are very welcome to submit patches for that! However, I don't want to make a decision here that is in conflict with opinions of others, so maybe it might be a good idea to call a vote on java-dev about this issue? Hits does seem to be very good at leading new Lucene users into writing non-performant code and on that basis is probably worth deprecating.
However, in it's place I think we can do more to support the commonest search scenario users will have, which I suspect is the typical web-based search application, serving a page of "results xx to yy of z,zzz,zzz".. The HitPageCollector in the link below is more helpful in this respect than the raw TopDocCollector - it takes a "start" value in the constructor which represents the "xx" in the above. It could be extended to have read-only properties representing y and z too and helper methods e.g. isFirstPage, isLastPage. See http://www.mail-archive.com/java-user@lucene.apache.org/msg15344.html
So it seems that a fair amount of people agree here that Hits should be deprecated. Hits will still be part of Lucene 2.4 and 2.9. In the meantime, we can develop new helper classes, So in case of no negative votes, I will commit this patch before the weekend. Beefed up doPagingSearch() in the SearchFiles demo.
No It collects enough results for the first 5 result pages. If the user wants to skip beyond the 5th page, then the query is reexecuted to collect all hits, similar to Google. The user can now also page back or jump directly to a page number. I'm going to commit this patch now. i've been away from the party while all the exciting stuff was being discussed, but for my 2 cents...
As Michael said: Hits is deprecated, not deleted. clients will have plenty of time to change, and the javadocs for Hits provides a good explanation for how to use TopDocs instead – but if people want to implement an "Iterator" based alternative, we should probably do that in a new issue. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||