Details

    • Type: Task Task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The Hits class has several drawbacks as pointed out in LUCENE-954.
      The other search APIs that use TopDocCollector and TopDocs should be used instead.

      This patch:

      • deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
        the Searcher.search( * ) methods which return a Hits Object.
      • removes all references to Hits from the core and uses TopDocs and ScoreDoc
        instead
      • Changes the demo SearchFiles: adds the two modes 'paging search' and 'streaming search',
        each of which demonstrating a different way of using the search APIs. The former
        uses TopDocs and a TopDocCollector, the latter a custom HitCollector implementation.
      • Updates the online tutorial that descibes the demo.

      All tests pass.

      1. lucene-1290.patch
        214 kB
        Michael Busch
      2. lucene-1290.patch
        218 kB
        Michael Busch
      3. lucene-1290.patch
        236 kB
        Michael Busch

        Activity

        Hide
        Doug Cutting added a comment -

        FWIW, the Hits API was originally designed to support desktop applications, with a scrollable pane of hits. I wonder if anyone ever actually used it that way, and, if so, whether it worked well...

        Show
        Doug Cutting added a comment - FWIW, the Hits API was originally designed to support desktop applications, with a scrollable pane of hits. I wonder if anyone ever actually used it that way, and, if so, whether it worked well...
        Hide
        Michael Busch added a comment -

        FWIW, the Hits API was originally designed to support desktop applications, with a scrollable pane of hits. I wonder if anyone ever actually used it that way, and, if so, whether it worked well...

        That's how it was used in the SeartchFiles demo and it worked quite well for this small tool.

        But I believe in real-world apps Hits has quite a few drawbacks. For example, it always
        fetches 50 results and buffers them. This number is not configurable which it should be
        for performance reasons. Also, I think very often users need to implement their own
        HitCollector anyways, e. g. for duplication detected or security, and can't use Hits then.

        As the new demo shows, it is quite easy to implement paging without Hits, just by using
        TopDocs and TopDocCollector.

        Show
        Michael Busch added a comment - FWIW, the Hits API was originally designed to support desktop applications, with a scrollable pane of hits. I wonder if anyone ever actually used it that way, and, if so, whether it worked well... That's how it was used in the SeartchFiles demo and it worked quite well for this small tool. But I believe in real-world apps Hits has quite a few drawbacks. For example, it always fetches 50 results and buffers them. This number is not configurable which it should be for performance reasons. Also, I think very often users need to implement their own HitCollector anyways, e. g. for duplication detected or security, and can't use Hits then. As the new demo shows, it is quite easy to implement paging without Hits, just by using TopDocs and TopDocCollector.
        Hide
        Hoss Man added a comment -

        FYI: when applying the patch, i got two failed hunks in src/test/org/apache/lucene/search/TestCustomSearcherSort.java – but i didn't dig into why, possibly due to eol-style not being set on the files or perhaps because of tabs (see below)

        The only files I looked at closely where the Hits class, the demo code, and the docs...

        We should probably make the deprecation of Hits more obvious in it's class level javadocs ... i would even go so far as to suggest including sample code showing the difference for people who are use to using hits.

        The new SearchFiles.java demo app seems straight forward and easy to understand, although we should probably add some more javadocs to doPagingSearch and doStreamingSearch to explain what we mean by those concepts. The new code in both of these methods (ie: lines using HitCollector and TopDocCollector) seem to be indented with tabs and not space characters ... so that should be normalized (i didn't check other files modified by the patch to see if they were likewise affected)

        The changes to demo2.xml seem straight forward enough ... although I notice Hits is still referenced in scoring.xml so that should be changed also.

        Show
        Hoss Man added a comment - FYI: when applying the patch, i got two failed hunks in src/test/org/apache/lucene/search/TestCustomSearcherSort.java – but i didn't dig into why, possibly due to eol-style not being set on the files or perhaps because of tabs (see below) The only files I looked at closely where the Hits class, the demo code, and the docs... We should probably make the deprecation of Hits more obvious in it's class level javadocs ... i would even go so far as to suggest including sample code showing the difference for people who are use to using hits. The new SearchFiles.java demo app seems straight forward and easy to understand, although we should probably add some more javadocs to doPagingSearch and doStreamingSearch to explain what we mean by those concepts. The new code in both of these methods (ie: lines using HitCollector and TopDocCollector) seem to be indented with tabs and not space characters ... so that should be normalized (i didn't check other files modified by the patch to see if they were likewise affected) The changes to demo2.xml seem straight forward enough ... although I notice Hits is still referenced in scoring.xml so that should be changed also.
        Hide
        Michael Busch added a comment -

        Thanks for the review, Hoss!

        Your suggestions make sense. I'll update the patch.

        Show
        Michael Busch added a comment - Thanks for the review, Hoss! Your suggestions make sense. I'll update the patch.
        Hide
        Otis Gospodnetic added a comment -

        I'm actually feeling -1-ish about this. I don't think Hits are hurting those who are truly concerned about performance. Those who want performance have other API options. But Hits is so nice and simple, and that must be valuable to a large portion of Lucene users (think CD searches, site searches, desktop search apps, etc., not massive distributed searches and such).

        Why can't we let Hits live? If we are concerned about its performance, we can easily javadoc and Wiki that.

        Show
        Otis Gospodnetic added a comment - I'm actually feeling -1-ish about this. I don't think Hits are hurting those who are truly concerned about performance. Those who want performance have other API options. But Hits is so nice and simple, and that must be valuable to a large portion of Lucene users (think CD searches, site searches, desktop search apps, etc., not massive distributed searches and such). Why can't we let Hits live? If we are concerned about its performance, we can easily javadoc and Wiki that.
        Hide
        Michael Busch added a comment -

        Otis,

        what do you think are the advantages of Hits over the TopDocCollector/TopDocs approach? If you look at the demo in this patch then you'll notice that I actually didn't make many changes and that it's just as easy to do paging without Hits.

        I think almost everybody who wants to have a decent control over performance and memory consumption during hit collection should use a HitCollector.
        Also, removing Hits will simplify the search methods and we'll have some APIs less to support .

        Show
        Michael Busch added a comment - Otis, what do you think are the advantages of Hits over the TopDocCollector/TopDocs approach? If you look at the demo in this patch then you'll notice that I actually didn't make many changes and that it's just as easy to do paging without Hits. I think almost everybody who wants to have a decent control over performance and memory consumption during hit collection should use a HitCollector. Also, removing Hits will simplify the search methods and we'll have some APIs less to support .
        Hide
        Paul Elschot added a comment -

        From the more or less regular mailing list discussions on performance issues with Hits I think one can safely conclude that the real problem is that everyone who tries to use Hits "for something else" loses time. Quite a while ago this also happened to me, and I still remember that enough to vote for deprecation.

        Show
        Paul Elschot added a comment - From the more or less regular mailing list discussions on performance issues with Hits I think one can safely conclude that the real problem is that everyone who tries to use Hits "for something else" loses time. Quite a while ago this also happened to me, and I still remember that enough to vote for deprecation.
        Hide
        Michael Busch added a comment -

        New version of the patch:

        • added TopDocCollector example to deprecated-section in the javadocs of Hits
        • added more comments to new demo code
        • updated scoring.html and removed references to Hits
        • got rid of tabs (patch only uses whitespaces now)

        Hoss, could you try if this patch applies cleanly now and all tests pass, please?
        After I committed eol-style=native to all files, changed the tabs to whitespaces
        and recreated the patch file patching and running the tests worked fine for me
        on linux.

        Wow, I didn't imagine before that a patch that simply deprecates one class
        would have more than 5000 lines!

        Show
        Michael Busch added a comment - New version of the patch: added TopDocCollector example to deprecated-section in the javadocs of Hits added more comments to new demo code updated scoring.html and removed references to Hits got rid of tabs (patch only uses whitespaces now) Hoss, could you try if this patch applies cleanly now and all tests pass, please? After I committed eol-style=native to all files, changed the tabs to whitespaces and recreated the patch file patching and running the tests worked fine for me on linux. Wow, I didn't imagine before that a patch that simply deprecates one class would have more than 5000 lines!
        Hide
        Christian Kohlschütter added a comment -

        -1 from me for the current solution.

        Deprecating Hits necessarily means deprecating HitIterator. With Hits/HitIterator we have two really simple ways to iterate over a long list of search results. The TopDocs/HitCollector-based approach is basically one level below Hits, and thus, Hits can clearly be regarded a convenience class then. It is not as flexible as HitCollector, but serves its purpose very well.

        What could make sense is to deprecate the Searcher#search() methods which return a Hits instance, to reduce API clutter. Hits could have a public constructor that takes a Searcher instance instead.

        Show
        Christian Kohlschütter added a comment - -1 from me for the current solution. Deprecating Hits necessarily means deprecating HitIterator. With Hits/HitIterator we have two really simple ways to iterate over a long list of search results. The TopDocs/HitCollector-based approach is basically one level below Hits, and thus, Hits can clearly be regarded a convenience class then. It is not as flexible as HitCollector, but serves its purpose very well. What could make sense is to deprecate the Searcher#search() methods which return a Hits instance, to reduce API clutter. Hits could have a public constructor that takes a Searcher instance instead.
        Hide
        Michael Busch added a comment -

        With Hits/HitIterator we have two really simple ways to iterate over a long list of search results.

        I think this is exactly a problem of Hits. If you use an HitIterator to iterate over let's say
        2000 results, then Hits will run the same query 5 times under the covers with 100, 200, 400,
        800, 1600 as values for the heap used in TopDocCollector.

        IMO Hits only makes sense if you want to use it for paging or, as Doug pointed out, for
        prefetching of hits in a scrollable pane. But then it's just as easy to implement this using
        TopDocCollector/TopDocs as shown in the SearchFiles demo (in this patch). The latter approach
        is also much more flexible, as it allows you to control the parameters.

        The TopDocs/HitCollector-based approach is basically one level below Hits, and thus, Hits can
        clearly be regarded a convenience class then.

        What are in your opinion the advantages of using an Iterator interface instead of looping over
        a ScoreDoc[] array?

        Show
        Michael Busch added a comment - With Hits/HitIterator we have two really simple ways to iterate over a long list of search results. I think this is exactly a problem of Hits. If you use an HitIterator to iterate over let's say 2000 results, then Hits will run the same query 5 times under the covers with 100, 200, 400, 800, 1600 as values for the heap used in TopDocCollector. IMO Hits only makes sense if you want to use it for paging or, as Doug pointed out, for prefetching of hits in a scrollable pane. But then it's just as easy to implement this using TopDocCollector/TopDocs as shown in the SearchFiles demo (in this patch). The latter approach is also much more flexible, as it allows you to control the parameters. The TopDocs/HitCollector-based approach is basically one level below Hits, and thus, Hits can clearly be regarded a convenience class then. What are in your opinion the advantages of using an Iterator interface instead of looping over a ScoreDoc[] array?
        Hide
        Christian Kohlschütter added a comment -

        Michael,

        the current implementation of Hits certainly has its deficiencies, but represents a very simple way to retrieve documents from Lucene. As long as there is no real replacement, I simply do not a reason to deprecate it.

        A replacement could be an API which allows something like:

        for(Iterator<ScoreDoc> it = searcher.iterator(query); it.hasNext(); )

        { (...) if (...) break; }
        Show
        Christian Kohlschütter added a comment - Michael, the current implementation of Hits certainly has its deficiencies, but represents a very simple way to retrieve documents from Lucene. As long as there is no real replacement, I simply do not a reason to deprecate it. A replacement could be an API which allows something like: for(Iterator<ScoreDoc> it = searcher.iterator(query); it.hasNext(); ) { (...) if (...) break; }
        Hide
        Michael Busch added a comment -

        A replacement could be an API which allows something like:

        for(Iterator<ScoreDoc> it = searcher.iterator(query); it.hasNext(); )

        Unknown macro: { (...) if (...) break; }

        That would duplicate the search methods that use a HitCollector.
        I still don't understand why an iterator approach is better/easier
        than Lucene's callback (HitCollector) approach.

        Show
        Michael Busch added a comment - A replacement could be an API which allows something like: for(Iterator<ScoreDoc> it = searcher.iterator(query); it.hasNext(); ) Unknown macro: { (...) if (...) break; } That would duplicate the search methods that use a HitCollector. I still don't understand why an iterator approach is better/easier than Lucene's callback (HitCollector) approach.
        Hide
        Uwe Schindler added a comment -

        The HitCollerctor and Iterator approach only supports forward displaying of results. In a typical Google-like Website, where the user can just jump to page X, display some results and can jump back to page Y and display results from there too, Hits works really good. The problem with Hits is, that it returns and caches whole "Documents". If it could just return ScoreDocs and would implement the Java Collection API "List" Interface (using AbstractList), that would be a good replacement for "navigateable" result sets.

        Show
        Uwe Schindler added a comment - The HitCollerctor and Iterator approach only supports forward displaying of results. In a typical Google-like Website, where the user can just jump to page X, display some results and can jump back to page Y and display results from there too, Hits works really good. The problem with Hits is, that it returns and caches whole "Documents". If it could just return ScoreDocs and would implement the Java Collection API "List" Interface (using AbstractList), that would be a good replacement for "navigateable" result sets.
        Hide
        Christian Kohlschütter added a comment -

        Michael:
        The HitCollector callback is called in index order (or in any other, non-deterministic order), whereas the results in Hits are sorted (by relevance or any given Sort order).

        Uwe:
        Good idea, this would be even better than the plain iterator class.

        Show
        Christian Kohlschütter added a comment - Michael: The HitCollector callback is called in index order (or in any other, non-deterministic order), whereas the results in Hits are sorted (by relevance or any given Sort order). Uwe: Good idea, this would be even better than the plain iterator class.
        Hide
        Uwe Schindler added a comment - - edited

        Christian:
        For my project "panFMP", I created a List-based wrapper on top of current "Hits", which also uses some special assumptions on the index structure behind (which is used by panFMP): http://panfmp.svn.sourceforge.net/viewvc/panfmp/main/trunk/src/de/pangaea/metadataportal/search/SearchResultList.java?view=markup
        On my plan is to change this to use TopDocCollector and only cache the ScoreDoc instances (id and score). My example just shows, how AbstractList may be used. The problem with AbstractList is, that get() cannot throw Exceptions, but you get the Iterator for free. For better performance, your could also overwrite iterator() and use a sequential TopDocs-based approach for fetching results.

        With the Java Collections List you could also do this syntactical sugar for displaying Google-Like search results (based on the implementation above):

        int start=0,count=10;
        List<SearchResultItem> page=list.subList(
        Math.min(start, list.size()),
        Math.min(start+count, list.size())
        );
        for (SearchResultItem item : page)

        { System.out.println(item.getIdentifier()); }
        Show
        Uwe Schindler added a comment - - edited Christian: For my project "panFMP", I created a List-based wrapper on top of current "Hits", which also uses some special assumptions on the index structure behind (which is used by panFMP): http://panfmp.svn.sourceforge.net/viewvc/panfmp/main/trunk/src/de/pangaea/metadataportal/search/SearchResultList.java?view=markup On my plan is to change this to use TopDocCollector and only cache the ScoreDoc instances (id and score). My example just shows, how AbstractList may be used. The problem with AbstractList is, that get() cannot throw Exceptions, but you get the Iterator for free. For better performance, your could also overwrite iterator() and use a sequential TopDocs-based approach for fetching results. With the Java Collections List you could also do this syntactical sugar for displaying Google-Like search results (based on the implementation above): int start=0,count=10; List<SearchResultItem> page=list.subList( Math.min(start, list.size()), Math.min(start+count, list.size()) ); for (SearchResultItem item : page) { System.out.println(item.getIdentifier()); }
        Hide
        Michael Busch added a comment -

        The HitCollerctor and Iterator approach only supports forward displaying of results.

        Even if you want to show results 100-200, then you still need to sort at least 200 results, also Hits needs to, it doesn't do any magic. Then you can as well keep the ScoreDoc[] array to page back.
        Please take a look at the patch and the new demo code, that will help you understand how Hits and TopDocCollector/TopDocs (which is called from Hits) work.

        The HitCollector callback is called in index order (or in any other, non-deterministic order), whereas the results in Hits are sorted (by relevance or any given Sort order).

        Hits uses TopDocCollector/TopDocs for sorting. The same can be achieved using those APIs, which Searcher offers, directly. Please take a look at the patch. I replaced a lot of usages of Hits in the test cases with the TopDocs APIs. In most cases this was a one line change.

        Show
        Michael Busch added a comment - The HitCollerctor and Iterator approach only supports forward displaying of results. Even if you want to show results 100-200, then you still need to sort at least 200 results, also Hits needs to, it doesn't do any magic. Then you can as well keep the ScoreDoc[] array to page back. Please take a look at the patch and the new demo code, that will help you understand how Hits and TopDocCollector/TopDocs (which is called from Hits) work. The HitCollector callback is called in index order (or in any other, non-deterministic order), whereas the results in Hits are sorted (by relevance or any given Sort order). Hits uses TopDocCollector/TopDocs for sorting. The same can be achieved using those APIs, which Searcher offers, directly. Please take a look at the patch. I replaced a lot of usages of Hits in the test cases with the TopDocs APIs. In most cases this was a one line change.
        Hide
        Michael Busch added a comment -

        Of course I don't want to remove something that adds real value to Lucene. IMO, Hits does not. It is a class whose APIs we always discourage to use. Such a class does not belong in the core. The HitCollector (especially TopDocCollector) APIs are similar to use and offer more control, flexibility and insight.

        So I think we should deprecate Hits (it will still be part then of 2.4 and 2.9 and removed in 3.0). In the meantime, if people really think that there should be a replacement for Hits in the future, we can develop a new helper class in contib that doesn't have the drawbacks that Hits has, and contributors like Uwe and Christian are very welcome to submit patches for that!

        However, I don't want to make a decision here that is in conflict with opinions of others, so maybe it might be a good idea to call a vote on java-dev about this issue?

        Show
        Michael Busch added a comment - Of course I don't want to remove something that adds real value to Lucene. IMO, Hits does not. It is a class whose APIs we always discourage to use. Such a class does not belong in the core. The HitCollector (especially TopDocCollector) APIs are similar to use and offer more control, flexibility and insight. So I think we should deprecate Hits (it will still be part then of 2.4 and 2.9 and removed in 3.0). In the meantime, if people really think that there should be a replacement for Hits in the future, we can develop a new helper class in contib that doesn't have the drawbacks that Hits has, and contributors like Uwe and Christian are very welcome to submit patches for that! However, I don't want to make a decision here that is in conflict with opinions of others, so maybe it might be a good idea to call a vote on java-dev about this issue?
        Hide
        Mark Harwood added a comment -

        Hits does seem to be very good at leading new Lucene users into writing non-performant code and on that basis is probably worth deprecating.

        However, in it's place I think we can do more to support the commonest search scenario users will have, which I suspect is the typical web-based search application, serving a page of "results xx to yy of z,zzz,zzz"..

        The HitPageCollector in the link below is more helpful in this respect than the raw TopDocCollector - it takes a "start" value in the constructor which represents the "xx" in the above. It could be extended to have read-only properties representing y and z too and helper methods e.g. isFirstPage, isLastPage.
        I'm sure many of us have coded this same logic in many different places over the years and I think it could usefully be in core as a better embodiment of best practice than Hits.

        See http://www.mail-archive.com/java-user@lucene.apache.org/msg15344.html

        Show
        Mark Harwood added a comment - Hits does seem to be very good at leading new Lucene users into writing non-performant code and on that basis is probably worth deprecating. However, in it's place I think we can do more to support the commonest search scenario users will have, which I suspect is the typical web-based search application, serving a page of "results xx to yy of z,zzz,zzz".. The HitPageCollector in the link below is more helpful in this respect than the raw TopDocCollector - it takes a "start" value in the constructor which represents the "xx" in the above. It could be extended to have read-only properties representing y and z too and helper methods e.g. isFirstPage, isLastPage. I'm sure many of us have coded this same logic in many different places over the years and I think it could usefully be in core as a better embodiment of best practice than Hits. See http://www.mail-archive.com/java-user@lucene.apache.org/msg15344.html
        Hide
        Michael Busch added a comment -

        Hits does seem to be very good at leading new Lucene users into writing non-performant code and on that basis is probably worth deprecating.

        So it seems that a fair amount of people agree here that Hits should be deprecated.

        Hits will still be part of Lucene 2.4 and 2.9. In the meantime, we can develop new helper classes,
        like Mark's HitPageCollector, if people see the need for it. Contrib would be a good place for
        those classes.

        So in case of no negative votes, I will commit this patch before the weekend.

        Show
        Michael Busch added a comment - Hits does seem to be very good at leading new Lucene users into writing non-performant code and on that basis is probably worth deprecating. So it seems that a fair amount of people agree here that Hits should be deprecated. Hits will still be part of Lucene 2.4 and 2.9. In the meantime, we can develop new helper classes, like Mark's HitPageCollector, if people see the need for it. Contrib would be a good place for those classes. So in case of no negative votes, I will commit this patch before the weekend.
        Hide
        Michael Busch added a comment -

        Beefed up doPagingSearch() in the SearchFiles demo.
        No It collects enough results for the first 5 result pages.
        If the user wants to skip beyond the 5th page, then
        the query is reexecuted to collect all hits, similar to
        Google. The user can now also page back or jump
        directly to a page number.

        I'm going to commit this patch now.

        Show
        Michael Busch added a comment - Beefed up doPagingSearch() in the SearchFiles demo. No It collects enough results for the first 5 result pages. If the user wants to skip beyond the 5th page, then the query is reexecuted to collect all hits, similar to Google. The user can now also page back or jump directly to a page number. I'm going to commit this patch now.
        Hide
        Michael Busch added a comment -

        Committed.

        Show
        Michael Busch added a comment - Committed.
        Hide
        Hoss Man added a comment -

        i've been away from the party while all the exciting stuff was being discussed, but for my 2 cents...

        • I firmly believe deprecating Hits is the way to go
        • I recognize the value in having a simple API that implements the Iterator API like HitIterator, but that doesn't mean it needs to be powered by "Hits", An alternate approach would be to add an "iterator()" method to TopDocs and TopFieldDocs that returned an iterator over the ScoreDocs ... or even an "iterator(Searcher)" method that would return a Iterator of "Hit" objects – this could have the exact same API as the HitIterator and Hit classes currently do – just more efficiently and with different constructors.

        As Michael said: Hits is deprecated, not deleted. clients will have plenty of time to change, and the javadocs for Hits provides a good explanation for how to use TopDocs instead – but if people want to implement an "Iterator" based alternative, we should probably do that in a new issue.

        Show
        Hoss Man added a comment - i've been away from the party while all the exciting stuff was being discussed, but for my 2 cents... I firmly believe deprecating Hits is the way to go I recognize the value in having a simple API that implements the Iterator API like HitIterator, but that doesn't mean it needs to be powered by "Hits", An alternate approach would be to add an "iterator()" method to TopDocs and TopFieldDocs that returned an iterator over the ScoreDocs ... or even an "iterator(Searcher)" method that would return a Iterator of "Hit" objects – this could have the exact same API as the HitIterator and Hit classes currently do – just more efficiently and with different constructors. As Michael said: Hits is deprecated, not deleted. clients will have plenty of time to change, and the javadocs for Hits provides a good explanation for how to use TopDocs instead – but if people want to implement an "Iterator" based alternative, we should probably do that in a new issue.
        Hide
        Trejkaz added a comment -

        To answer Doug's initial question, yes, we are using this in a desktop application inside a Swing TableModel.

        Show
        Trejkaz added a comment - To answer Doug's initial question, yes, we are using this in a desktop application inside a Swing TableModel.

          People

          • Assignee:
            Michael Busch
            Reporter:
            Michael Busch
          • Votes:
            2 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development