Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4, 3.0
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898

      Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this issue to convince him to upload his patch.

      1. LUCENE-2215.patch
        18 kB
        Robert Muir
      2. LUCENE-2215.patch
        17 kB
        Robert Muir
      3. LUCENE-2215.patch
        23 kB
        Grant Ingersoll
      4. IterablePaging.java
        6 kB
        Aaron McCurry
      5. PagingCollector.java
        3 kB
        Aaron McCurry
      6. TestingPagingCollector.java
        4 kB
        Aaron McCurry

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5
          Hide
          Michael McCandless added a comment -

          We add this param to a protected method signature, so it would affect subclasses of IndexSearcher.

          Ahh, right. Well, I think we can make an exception here – subclassing IS is very expert.

          Show
          Michael McCandless added a comment - We add this param to a protected method signature, so it would affect subclasses of IndexSearcher. Ahh, right. Well, I think we can make an exception here – subclassing IS is very expert.
          Hide
          Robert Muir added a comment -

          Or is there some other back compat issue?

          We add this param to a protected method signature, so it would affect subclasses of IndexSearcher.

          Show
          Robert Muir added a comment - Or is there some other back compat issue? We add this param to a protected method signature, so it would affect subclasses of IndexSearcher.
          Hide
          Michael McCandless added a comment -

          For 3.x can we just add these methods to IndexSearcher (not Searcher/Searchable)? This would require the app to use IndexSearcher if they are not already, which is great because that's what they'll need to do in 4.0 anyway (since Searcher/Searchable are deprecated).

          Or is there some other back compat issue?

          Show
          Michael McCandless added a comment - For 3.x can we just add these methods to IndexSearcher (not Searcher/Searchable)? This would require the app to use IndexSearcher if they are not already, which is great because that's what they'll need to do in 4.0 anyway (since Searcher/Searchable are deprecated). Or is there some other back compat issue?
          Hide
          Robert Muir added a comment -

          I committed the patch to trunk: we can discuss if/how we should backport to 3.x, backwards break or add sophisticated layer or whatever.

          Show
          Robert Muir added a comment - I committed the patch to trunk: we can discuss if/how we should backport to 3.x, backwards break or add sophisticated layer or whatever.
          Hide
          Grant Ingersoll added a comment -

          +1.

          Show
          Grant Ingersoll added a comment - +1.
          Hide
          Robert Muir added a comment -

          renamed to searchAfter, added a little to the javadocs, and improved the test coverage a bit.

          Show
          Robert Muir added a comment - renamed to searchAfter, added a little to the javadocs, and improved the test coverage a bit.
          Hide
          Michael McCandless added a comment -

          +1 for TopDocs searchAfter(ScoreDoc previous, Query query, int n), though maybe name the param "after" not "previous"?

          Show
          Michael McCandless added a comment - +1 for TopDocs searchAfter(ScoreDoc previous, Query query, int n), though maybe name the param "after" not "previous"?
          Hide
          Robert Muir added a comment -

          Maybe instead 'searchAfter' ? perhaps we could rearrange the arguments too, to make it read better:
          instead of:

          TopDocs pagedSearch(Query query, ScoreDoc lastBottom, int n)

          we could do:

          TopDocs searchAfter(ScoreDoc previous, Query query, int n)

          or something like that?

          Show
          Robert Muir added a comment - Maybe instead 'searchAfter' ? perhaps we could rearrange the arguments too, to make it read better: instead of: TopDocs pagedSearch(Query query, ScoreDoc lastBottom, int n) we could do: TopDocs searchAfter(ScoreDoc previous, Query query, int n) or something like that?
          Hide
          Hoss Man added a comment -

          Hoss, how can they use this in combination with an ExecutorService then?

          I sync'ed up with rmuir on irc, i wasn't aware that the ExecutorService only worked with some IndexSeracher methods and not others. I see now why a new method is important (for deep paging you're really going ot want to use the ExecutorService), but i'm still a little concerned about hte name sucking in novice users who don't really need it to do paginated search.

          we can add detailed javadocs making it clear to folks what the diff is, but perhaps a diff name would be more appropriate? at a minimum maybe call it "deepPagingSearch" or something else that conveys more explicitly how it works ("searchAfterLastMatch" ?)

          Show
          Hoss Man added a comment - Hoss, how can they use this in combination with an ExecutorService then? I sync'ed up with rmuir on irc, i wasn't aware that the ExecutorService only worked with some IndexSeracher methods and not others. I see now why a new method is important (for deep paging you're really going ot want to use the ExecutorService), but i'm still a little concerned about hte name sucking in novice users who don't really need it to do paginated search. we can add detailed javadocs making it clear to folks what the diff is, but perhaps a diff name would be more appropriate? at a minimum maybe call it "deepPagingSearch" or something else that conveys more explicitly how it works ("searchAfterLastMatch" ?)
          Hide
          Robert Muir added a comment -

          Hoss, how can they use this in combination with an ExecutorService then?

          With the current patch: this works?

          Show
          Robert Muir added a comment - Hoss, how can they use this in combination with an ExecutorService then? With the current patch: this works?
          Hide
          Hoss Man added a comment -

          I'm not overly familiar with the historical evolution of this patch, but is a new IndexSearcher.pagedSearch really warranted in this case?

          Adding the "ScoreDoc lastBottom" param to TopScoreDocCollector.create seems like it would make this easy enough to use.

          My concern is that most novice users doing "simple paginated search" aren't ever going to hit the deep paging problem – keeping track of the current page number and multiplying by the number of items per page to determine the numHit value works just fine for most applications. But if they see an explicit IndexSearcher.pagedSearch method, they are going to think "this is what i need for paginated search" and then be confused as to how they should keep track in their app of the last ScoreDoc instance.

          I mean, i'm all in favor of simple convince methods – but isn't TopScoreDocCollector.create simple enough for the level of knowledge a user will need to use this effectively?

          Show
          Hoss Man added a comment - I'm not overly familiar with the historical evolution of this patch, but is a new IndexSearcher.pagedSearch really warranted in this case? Adding the "ScoreDoc lastBottom" param to TopScoreDocCollector.create seems like it would make this easy enough to use. My concern is that most novice users doing "simple paginated search" aren't ever going to hit the deep paging problem – keeping track of the current page number and multiplying by the number of items per page to determine the numHit value works just fine for most applications. But if they see an explicit IndexSearcher.pagedSearch method, they are going to think "this is what i need for paginated search" and then be confused as to how they should keep track in their app of the last ScoreDoc instance. I mean, i'm all in favor of simple convince methods – but isn't TopScoreDocCollector.create simple enough for the level of knowledge a user will need to use this effectively?
          Hide
          Michael McCandless added a comment -

          Patch looks great! I like the API – all the app must do is hold onto the bottom doc from last page and resubmit for next page.

          Show
          Michael McCandless added a comment - Patch looks great! I like the API – all the app must do is hold onto the bottom doc from last page and resubmit for next page.
          Hide
          Robert Muir added a comment -

          I think for this issue, we should just add IndexSearcher.pagedSearch() methods for paging, passing in the bottom result of the previous page (ScoreDoc lastBottom).

          Note for the first page of results, this means you pass in null for lastBottom and its doing the same thing as search(), so even if you choose to use this method your 'page 1 results' will be just as fast as IndexSearcher.search()

          Also, its easy to make a ScoreDoc from the int+float so apps that want to keep this state can probably do so easily (and maybe they benchmark and decide its only worthwhile to cutover to this once you hit page 3, or 4, or whatever).

          I also added a few optimizations to the previous code: specialized InOrder/OutOfOrder, subtract docBase from the bottom's doc in setNextReader to remove an add, etc.

          I didn't add the methods that take sort, but I think we could do this in a separate issue.

          Show
          Robert Muir added a comment - I think for this issue, we should just add IndexSearcher.pagedSearch() methods for paging, passing in the bottom result of the previous page (ScoreDoc lastBottom). Note for the first page of results, this means you pass in null for lastBottom and its doing the same thing as search(), so even if you choose to use this method your 'page 1 results' will be just as fast as IndexSearcher.search() Also, its easy to make a ScoreDoc from the int+float so apps that want to keep this state can probably do so easily (and maybe they benchmark and decide its only worthwhile to cutover to this once you hit page 3, or 4, or whatever). I also added a few optimizations to the previous code: specialized InOrder/OutOfOrder, subtract docBase from the bottom's doc in setNextReader to remove an add, etc. I didn't add the methods that take sort, but I think we could do this in a separate issue.
          Hide
          Charlie Hubbard added a comment -

          So this is not apart of 3.x yet. If I want to use this with 3.x where is the most recent version? It looks like Grant has made some changes since the original upload I'd like to get those changes since it sounds like it fixed several bugs in the original.

          Show
          Charlie Hubbard added a comment - So this is not apart of 3.x yet. If I want to use this with 3.x where is the most recent version? It looks like Grant has made some changes since the original upload I'd like to get those changes since it sounds like it fixed several bugs in the original.
          Hide
          Grant Ingersoll added a comment -

          I do think it is still valid and much needed, but haven't had time to work on it and I don't think we ever got agreement on the API level.

          Show
          Grant Ingersoll added a comment - I do think it is still valid and much needed, but haven't had time to work on it and I don't think we ever got agreement on the API level.
          Hide
          Simon Willnauer added a comment -

          it this still valid / needed? Grant are you planning to work on this any time soon?

          Show
          Simon Willnauer added a comment - it this still valid / needed? Grant are you planning to work on this any time soon?
          Hide
          Itamar Syn-Hershko added a comment -

          Thanks. I ended up using the standard Lucene paging code.

          Hopefully this will get into Lucene soon...

          Show
          Itamar Syn-Hershko added a comment - Thanks. I ended up using the standard Lucene paging code. Hopefully this will get into Lucene soon...
          Hide
          javi added a comment -

          Itamar, in case it's helpful for you: my code is not in production yet, but close, and I am not using latest patch, but the original one. I have not seen any issue in my regression tests against my older code where I was not using this.

          Show
          javi added a comment - Itamar, in case it's helpful for you: my code is not in production yet, but close, and I am not using latest patch, but the original one. I have not seen any issue in my regression tests against my older code where I was not using this.
          Hide
          Itamar Syn-Hershko added a comment -

          Hi guys, any update on this?

          I'm interested in using this for production code. Can anyone comment on how safe / mature this code is?

          Thanks!

          Show
          Itamar Syn-Hershko added a comment - Hi guys, any update on this? I'm interested in using this for production code. Can anyone comment on how safe / mature this code is? Thanks!
          Hide
          Shai Erera added a comment -

          Sure let's wait for the patch and some perf. results.

          Show
          Shai Erera added a comment - Sure let's wait for the patch and some perf. results.
          Hide
          Grant Ingersoll added a comment -

          Let's be careful about the semantics here Grant. Most if not all applications implement paging indeed, but I believe only FEW actually store user contexts between searches. PagingCollector relies on the application to store the lowest ranking doc that was returned previously, which means storing context between user's searches.

          I think, assuming the math plays out, that once you show the gains to be had here, esp. for deep paging, storing an int and a float is trivial. If they are implementing paging, they are already keeping state about what page they are on.

          Now they will need to think "where do I get this low score from?"

          Sorry, but If that is that hard to figure out, then I don't see how they have any business writing a Lucene application to begin with. A simple javadoc that says these two values are taken from the last result of the previously seen page should do the trick

          At any rate, let's put up the patches and find out instead of debating. I should have time today to do mine.

          Show
          Grant Ingersoll added a comment - Let's be careful about the semantics here Grant. Most if not all applications implement paging indeed, but I believe only FEW actually store user contexts between searches. PagingCollector relies on the application to store the lowest ranking doc that was returned previously, which means storing context between user's searches. I think, assuming the math plays out, that once you show the gains to be had here, esp. for deep paging, storing an int and a float is trivial. If they are implementing paging, they are already keeping state about what page they are on. Now they will need to think "where do I get this low score from?" Sorry, but If that is that hard to figure out, then I don't see how they have any business writing a Lucene application to begin with. A simple javadoc that says these two values are taken from the last result of the previously seen page should do the trick At any rate, let's put up the patches and find out instead of debating. I should have time today to do mine.
          Hide
          Shai Erera added a comment -

          since I think it's safe to say most applications implement paging

          Let's be careful about the semantics here Grant. Most if not all applications implement paging indeed, but I believe only FEW actually store user contexts between searches. PagingCollector relies on the application to store the lowest ranking doc that was returned previously, which means storing context between user's searches.

          I agree w/ Mike's statement about 99.9% of the searches would never run that code, which is why I've proposed a delegation/wrapper approach from the beginning. I also think that we should make some allowances here and there, for the non-common case, and introduce better software design than specialized code. A Collector filter approach for some rare (or even less common) cases seems very reasonable to me.

          Also, I think that if we add to TSDC a create method which takes into account the previously scored lowest doc, it will confuse people. Now they will need to think "where do I get this low score from?" - but perhaps after I see the code, it wouldn't be such a bad thing .... just have a feeling TSDC and TFC should be left on their own, and extreme paging stuff should either be its own specialized collector, or a wrapper.

          Show
          Shai Erera added a comment - since I think it's safe to say most applications implement paging Let's be careful about the semantics here Grant. Most if not all applications implement paging indeed, but I believe only FEW actually store user contexts between searches. PagingCollector relies on the application to store the lowest ranking doc that was returned previously, which means storing context between user's searches. I agree w/ Mike's statement about 99.9% of the searches would never run that code, which is why I've proposed a delegation/wrapper approach from the beginning. I also think that we should make some allowances here and there, for the non-common case, and introduce better software design than specialized code. A Collector filter approach for some rare (or even less common) cases seems very reasonable to me. Also, I think that if we add to TSDC a create method which takes into account the previously scored lowest doc, it will confuse people. Now they will need to think "where do I get this low score from?" - but perhaps after I see the code, it wouldn't be such a bad thing .... just have a feeling TSDC and TFC should be left on their own, and extreme paging stuff should either be its own specialized collector, or a wrapper.
          Hide
          Grant Ingersoll added a comment -

          Yeah, but one could make the argument, Mike, that the existing "optimizations" are useless for the most common case, since I think it's safe to say most applications implement paging. Of course, that being said, most users don't page all that deeply. Also, for something like Solr that prefetches the top 50 it might not be good, either. Still, in my mind it is one additional boolean check, as in:

          if ( (current stuff) || (pagingInfoPresent == true && paging check) )
          ...
          

          pagingInfoPresent can be determined at construction time and that whole clause would be short circuited very quickly.

          That being said, delegation could be done at construction time, too and more cleanly separates things. I'll try to put up my version tomorrow.

          Show
          Grant Ingersoll added a comment - Yeah, but one could make the argument, Mike, that the existing "optimizations" are useless for the most common case, since I think it's safe to say most applications implement paging. Of course, that being said, most users don't page all that deeply. Also, for something like Solr that prefetches the top 50 it might not be good, either. Still, in my mind it is one additional boolean check, as in: if ( (current stuff) || (pagingInfoPresent == true && paging check) ) ... pagingInfoPresent can be determined at construction time and that whole clause would be short circuited very quickly. That being said, delegation could be done at construction time, too and more cleanly separates things. I'll try to put up my version tomorrow.
          Hide
          Michael McCandless added a comment -

          ...through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test.

          This'd make me nervous...

          Ie I don't think we should insert bytecodes for the 99.9% of searches that wouldn't make use of this, even if we can't uncover a slowdown with benchmarking.

          We should still benchmark it though (I'm curious)... we should also benchmark the delegate solution.

          Show
          Michael McCandless added a comment - ...through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test. This'd make me nervous... Ie I don't think we should insert bytecodes for the 99.9% of searches that wouldn't make use of this, even if we can't uncover a slowdown with benchmarking. We should still benchmark it though (I'm curious)... we should also benchmark the delegate solution.
          Hide
          Uwe Schindler added a comment -

          Hey, and I want to fix the NaN thing in TSDC: LUCENE-2271

          Maybe when we delegate, we can also use my cool code that switches the delegate to remove on comparison after the queue is full.

          Show
          Uwe Schindler added a comment - Hey, and I want to fix the NaN thing in TSDC: LUCENE-2271 Maybe when we delegate, we can also use my cool code that switches the delegate to remove on comparison after the queue is full.
          Hide
          Grant Ingersoll added a comment -

          Mike, don't you think, though, that through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test.

          Show
          Grant Ingersoll added a comment - Mike, don't you think, though, that through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test.
          Hide
          Michael McCandless added a comment -

          This is a neat collector!

          I like the idea of chaining/filtering... couldn't we put this in core
          (under TFC/TSDC.create), but instead of doubling the 12 specialized
          (anonymous) impls we now have, just delegate?

          Ie, we'd make a FilteredCollector, taking another collector when it's
          created, and then on every collect call, only if the hit is "weak"
          enough (ie is worse than what the app provided as prev low score/doc)
          would it forward it to the delegate? I guess we should test perf w/
          (the new additions to benchmark – yay!) to see if specializing the
          code (even anonymously) is warranted.

          The indent whitespace needs to fixed to 2 spaces...

          Show
          Michael McCandless added a comment - This is a neat collector! I like the idea of chaining/filtering... couldn't we put this in core (under TFC/TSDC.create), but instead of doubling the 12 specialized (anonymous) impls we now have, just delegate? Ie, we'd make a FilteredCollector, taking another collector when it's created, and then on every collect call, only if the hit is "weak" enough (ie is worse than what the app provided as prev low score/doc) would it forward it to the delegate? I guess we should test perf w/ (the new additions to benchmark – yay!) to see if specializing the code (even anonymously) is warranted. The indent whitespace needs to fixed to 2 spaces...
          Hide
          Grant Ingersoll added a comment -

          The only complication I see is that if we want to make it extremely efficient, we'll need to double the number of Collector impls for TSDC and TFC (the internal instances that are created) ...

          I'm not convinced yet. I think we can likely make it short circuit quite fast for the non-paging case, but rather than guess, let's benchmark. I'm extracting my Benchmark collector stuff right now on LUCENE-2343. I also am not sure we need to double the number of collectors.

          Show
          Grant Ingersoll added a comment - The only complication I see is that if we want to make it extremely efficient, we'll need to double the number of Collector impls for TSDC and TFC (the internal instances that are created) ... I'm not convinced yet. I think we can likely make it short circuit quite fast for the non-paging case, but rather than guess, let's benchmark. I'm extracting my Benchmark collector stuff right now on LUCENE-2343 . I also am not sure we need to double the number of collectors.
          Hide
          Shai Erera added a comment -

          Ohh I see. Missed that. So that can be folded into TSDC.create, or actually another create method which in addition to numHits specifies a 'startFrom' w/ a ScoreDoc, and a similar one for TFC.create. The only complication I see is that if we want to make it extremely efficient, we'll need to double the number of Collector impls for TSDC and TFC (the internal instances that are created) ...

          Then ... I think ... TDC can have just one topDocs() indeed.

          But the idea of doubling the number of collector impls worries me. Especially for cases where even for paging, you won't use the Paging thing because you simply don't maintain any state on the server. I wonder if for those apps, who do maintain the state, and I imagine there aren't lots of them, having a PagingCollector which either wraps TSDC, or is the one you provided in the patch (but does not extend TDC) would be better? It won't unnecessarily over-complicate the code of TSDC and TFC (which will have 12 impls) and give you sort of what you need - if you maintain state, then you know that's the second page the user requested, so using a PagingCollector specifically doesn't sound that bad to me?

          Show
          Shai Erera added a comment - Ohh I see. Missed that. So that can be folded into TSDC.create, or actually another create method which in addition to numHits specifies a 'startFrom' w/ a ScoreDoc, and a similar one for TFC.create. The only complication I see is that if we want to make it extremely efficient, we'll need to double the number of Collector impls for TSDC and TFC (the internal instances that are created) ... Then ... I think ... TDC can have just one topDocs() indeed. But the idea of doubling the number of collector impls worries me. Especially for cases where even for paging, you won't use the Paging thing because you simply don't maintain any state on the server. I wonder if for those apps, who do maintain the state, and I imagine there aren't lots of them, having a PagingCollector which either wraps TSDC, or is the one you provided in the patch (but does not extend TDC) would be better? It won't unnecessarily over-complicate the code of TSDC and TFC (which will have 12 impls) and give you sort of what you need - if you maintain state, then you know that's the second page the user requested, so using a PagingCollector specifically doesn't sound that bad to me?
          Hide
          Grant Ingersoll added a comment -

          I'm saying PagingColl. doesn't even exist and it is just folded into the two existing In/Out Collectors with a new create() method that knows when it's paging and when it's not.

          Show
          Grant Ingersoll added a comment - I'm saying PagingColl. doesn't even exist and it is just folded into the two existing In/Out Collectors with a new create() method that knows when it's paging and when it's not.
          Hide
          Shai Erera added a comment -

          So what's the motivation of declaring PagingCollector a TopDocsCollector? Would you envision one to request for a TopDocsCollector but don't care if it's TSDC, TFC or PagingCollector? I would rather have it extend TDC directly, and then you won't need to throw UOE for the rest of the methods ...

          What about renaming it to TopScorePagingCollector?

          Show
          Shai Erera added a comment - So what's the motivation of declaring PagingCollector a TopDocsCollector? Would you envision one to request for a TopDocsCollector but don't care if it's TSDC, TFC or PagingCollector? I would rather have it extend TDC directly, and then you won't need to throw UOE for the rest of the methods ... What about renaming it to TopScorePagingCollector?
          Hide
          Grant Ingersoll added a comment -

          I must admit I don't like throwing UOE. I imagine the naive user calling one of these and hit w/ UOE out of nowhere really . Perhaps it's a sign PagingCollector should not be a sub-class of TopDocsCollector? It does not benefit from it in any way because it overrides all the main methods, impls them or throws UOE for those it doesn't like. So perhaps it should just be a TopScorePagingCollector which copies some of the functionality of TSDC, but is not a TDC itself. It will have a topDocs() method, and only it (b/c I agree the rest don't make any sense).

          I agree, not a huge fan of it either, but it is bad form to call it when using this collector and I'd rather people learn that up front. Like I said in the last comment, I think we'd be better off trying to integrate this in to a lower level and not even having a "special" collector. If we just added a create option that took in the necessary info, then we could just mod the existing collectors, possibly. Then those two topDocs methods could just be deprecated/removed.

          Show
          Grant Ingersoll added a comment - I must admit I don't like throwing UOE. I imagine the naive user calling one of these and hit w/ UOE out of nowhere really . Perhaps it's a sign PagingCollector should not be a sub-class of TopDocsCollector? It does not benefit from it in any way because it overrides all the main methods, impls them or throws UOE for those it doesn't like. So perhaps it should just be a TopScorePagingCollector which copies some of the functionality of TSDC, but is not a TDC itself. It will have a topDocs() method, and only it (b/c I agree the rest don't make any sense). I agree, not a huge fan of it either, but it is bad form to call it when using this collector and I'd rather people learn that up front. Like I said in the last comment, I think we'd be better off trying to integrate this in to a lower level and not even having a "special" collector. If we just added a create option that took in the necessary info, then we could just mod the existing collectors, possibly. Then those two topDocs methods could just be deprecated/removed.
          Hide
          Shai Erera added a comment -

          I must admit I don't like throwing UOE. I imagine the naive user calling one of these and hit w/ UOE out of nowhere really . Perhaps it's a sign PagingCollector should not be a sub-class of TopDocsCollector? It does not benefit from it in any way because it overrides all the main methods, impls them or throws UOE for those it doesn't like. So perhaps it should just be a TopScorePagingCollector which copies some of the functionality of TSDC, but is not a TDC itself. It will have a topDocs() method, and only it (b/c I agree the rest don't make any sense).

          Notice the different name I propose - to make it clear it's a collector that can be used for paging through a scored list of results.

          I BTW liked that the if/else clauses were separated, b/c you could include meaningful documentation for each. Right now those are just very long lines.

          About in-order, I think the only thing you will save is the last 'else'. Read my comment above about wrapping TSDC ... not sure about it, but it will make it more elegant.

          I'll review the rest of the patch. Didn't yet understand what's PagingIterable for ...

          Show
          Shai Erera added a comment - I must admit I don't like throwing UOE. I imagine the naive user calling one of these and hit w/ UOE out of nowhere really . Perhaps it's a sign PagingCollector should not be a sub-class of TopDocsCollector? It does not benefit from it in any way because it overrides all the main methods, impls them or throws UOE for those it doesn't like. So perhaps it should just be a TopScorePagingCollector which copies some of the functionality of TSDC, but is not a TDC itself. It will have a topDocs() method, and only it (b/c I agree the rest don't make any sense). Notice the different name I propose - to make it clear it's a collector that can be used for paging through a scored list of results. I BTW liked that the if/else clauses were separated, b/c you could include meaningful documentation for each. Right now those are just very long lines. About in-order, I think the only thing you will save is the last 'else'. Read my comment above about wrapping TSDC ... not sure about it, but it will make it more elegant. I'll review the rest of the patch. Didn't yet understand what's PagingIterable for ...
          Hide
          Grant Ingersoll added a comment -

          BTW, I've noticed that you don't track maxScore

          Good point. I think we probably should track it, so that the PagingColl could be used right from the get go.

          We might also consider deprecating the topDocs() methods that take in parameters and think about how the paging collector might be integrated at a lower level in the other collectors, such that one doesn't even have to think about calling a diff. collector.

          Show
          Grant Ingersoll added a comment - BTW, I've noticed that you don't track maxScore Good point. I think we probably should track it, so that the PagingColl could be used right from the get go. We might also consider deprecating the topDocs() methods that take in parameters and think about how the paging collector might be integrated at a lower level in the other collectors, such that one doesn't even have to think about calling a diff. collector.
          Hide
          Grant Ingersoll added a comment -

          Here's an update of Aaron's work with the following changes:

          1. Added real unit tests
          2. Made topDocs() non final in order to override in PagingCollector to handle the case where the some edge cases with larger PQ size than total hits. Overrode the other topDocs(...) methods to throw UnsupportedOperation as they aren't needed for a Paging Collector
          3. Pass in num already seen so that PQ operations can be calculated correctly. Not sure if we really need, but otherwise it puts the burden on the user to make sure the PQ is sized properly, I think, which may not be such a bad burden
          4. Renamed IterablePaging to be PagingIterable. Not a huge fan of that name either, but couldn't think of anything better
          5. Collapsed the if/else clauses in the collect method into a single if clause.

          Left to do:
          1. benchmark. Is it really better?
          2. Not entirely certain on the PagingIterable API stuff yet. Looks useful.
          3. Should we have an InOrder Collector as well? Seems like we might be able to save a few operations per doc.

          Show
          Grant Ingersoll added a comment - Here's an update of Aaron's work with the following changes: 1. Added real unit tests 2. Made topDocs() non final in order to override in PagingCollector to handle the case where the some edge cases with larger PQ size than total hits. Overrode the other topDocs(...) methods to throw UnsupportedOperation as they aren't needed for a Paging Collector 3. Pass in num already seen so that PQ operations can be calculated correctly. Not sure if we really need, but otherwise it puts the burden on the user to make sure the PQ is sized properly, I think, which may not be such a bad burden 4. Renamed IterablePaging to be PagingIterable. Not a huge fan of that name either, but couldn't think of anything better 5. Collapsed the if/else clauses in the collect method into a single if clause. Left to do: 1. benchmark. Is it really better? 2. Not entirely certain on the PagingIterable API stuff yet. Looks useful. 3. Should we have an InOrder Collector as well? Seems like we might be able to save a few operations per doc.
          Hide
          Shai Erera added a comment -

          I've reviewed PagingCollector.java and the first thing I have to say about it is that I really like it ! Saves lots of unnecessary heapify code, if the application can allow itself to store the lowest last SD.

          I have few comments/questions.

          I don't understand what getLastScoreDoc is for? Is it just a utility method? Is it something the app can compute by itself? Anyway, it lacks javadocs, so perhaps if they existed I wouldn't need to ask .

          In collect(), there's the following code:

          		} else if (score == previousPassLowest.score && doc <= previousPassLowest.doc) {
          			// if the scores are the same and the doc is less than or equal to
          			// the
          			// previous pass lowest hit doc then skip because this collector
          			// favors
          			// lower number documents.
          			return;
          

          I think there's a typo in the comment "favors lower number documents" .. while it seems to prefer higher doc IDs? The way I understand it, irregardless of whether docs are collected in/out of order, HitQueue ensures that when scores are equals, the lowest IDs are favored. Thus the first round always keeps the lowest IDs among the docs whose scores match. The next round will favor the docs whose IDs come next, and so forth ... am I right? (just clarifying my understanding).
          If that's the case, I think it'll be good if it's spelled out in the comment, and also mention that it means that document has already been returned previously (like it's documented in the previous 'if').

          The last 'else' really looks like TSDC's out-of-order version, which makes me think whether PagingCollector can be viewed as a filter on top of TSDC (and possibly even TopFieldCollector)? So if a hit should be collected, it just calls super.collect? I realize though that a Collector is a hotspot and we want to minimize 'if' let alone method call statements as much as possible. But it just feels so strong that it should be a filter ... . And you wouldn't need to specifically handle in/out orderness ... and w/ the right design, it can also wrap a TFC or any other TDC implementation ...

          BTW, I've noticed that you don't track maxScore - is it assumed that the application stores it from the first round? If so I'd document it, because the application needs to know it should use TSDC the first round, and PagingCollector the second round.

          Also, PagingCollector offers a ctor which does not force the application to pass in a ScoreDoc. See my comment from above - it might be misleading, because if you use this collector right from the very first search, you lose the maxScore tracking. I also don't see why it should be allowed - if a dummy previousPassLowest ScoreDoc is used, collect() does a lot of unnecessary 'if's. I think this collector should be used only from the second round, and a single ctor which forces a ScoreDoc to be passed would make more sense. If the application wishes to shoot itself in the leg (performance-wise), it can pass a dummy SD itself.

          Show
          Shai Erera added a comment - I've reviewed PagingCollector.java and the first thing I have to say about it is that I really like it ! Saves lots of unnecessary heapify code, if the application can allow itself to store the lowest last SD. I have few comments/questions. I don't understand what getLastScoreDoc is for? Is it just a utility method? Is it something the app can compute by itself? Anyway, it lacks javadocs, so perhaps if they existed I wouldn't need to ask . In collect(), there's the following code: } else if (score == previousPassLowest.score && doc <= previousPassLowest.doc) { // if the scores are the same and the doc is less than or equal to // the // previous pass lowest hit doc then skip because this collector // favors // lower number documents. return ; I think there's a typo in the comment "favors lower number documents" .. while it seems to prefer higher doc IDs? The way I understand it, irregardless of whether docs are collected in/out of order, HitQueue ensures that when scores are equals, the lowest IDs are favored. Thus the first round always keeps the lowest IDs among the docs whose scores match. The next round will favor the docs whose IDs come next, and so forth ... am I right? (just clarifying my understanding). If that's the case, I think it'll be good if it's spelled out in the comment, and also mention that it means that document has already been returned previously (like it's documented in the previous 'if'). The last 'else' really looks like TSDC's out-of-order version, which makes me think whether PagingCollector can be viewed as a filter on top of TSDC (and possibly even TopFieldCollector)? So if a hit should be collected, it just calls super.collect? I realize though that a Collector is a hotspot and we want to minimize 'if' let alone method call statements as much as possible. But it just feels so strong that it should be a filter ... . And you wouldn't need to specifically handle in/out orderness ... and w/ the right design, it can also wrap a TFC or any other TDC implementation ... BTW, I've noticed that you don't track maxScore - is it assumed that the application stores it from the first round? If so I'd document it, because the application needs to know it should use TSDC the first round, and PagingCollector the second round. Also, PagingCollector offers a ctor which does not force the application to pass in a ScoreDoc. See my comment from above - it might be misleading, because if you use this collector right from the very first search, you lose the maxScore tracking. I also don't see why it should be allowed - if a dummy previousPassLowest ScoreDoc is used, collect() does a lot of unnecessary 'if's. I think this collector should be used only from the second round, and a single ctor which forces a ScoreDoc to be passed would make more sense. If the application wishes to shoot itself in the leg (performance-wise), it can pass a dummy SD itself.
          Hide
          Grant Ingersoll added a comment -

          I think in order to properly implement this, topDocs() needs to be non-final, otherwise there is some oddities in initing a PQ with more results than are available once paging. Updated patch shortly.

          Show
          Grant Ingersoll added a comment - I think in order to properly implement this, topDocs() needs to be non-final, otherwise there is some oddities in initing a PQ with more results than are available once paging. Updated patch shortly.
          Hide
          javi added a comment -

          disregard my previous comment... There was some refactoring in my codebase to get this in and NB_LUCENE_HITS_PER_BATCH was uninitialized...so far it is working sweetly, I will report when I finish my tests.

          thanks

          Show
          javi added a comment - disregard my previous comment... There was some refactoring in my codebase to get this in and NB_LUCENE_HITS_PER_BATCH was uninitialized...so far it is working sweetly, I will report when I finish my tests. thanks
          Hide
          javi added a comment -

          Kudos Aaron, this is cool for what I need.

          I just integrated in my project, upgraded to 3.0 just to get this in. But I am having an issue in my first test:

          java.lang.ArrayIndexOutOfBoundsException: 1
          at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:96)
          at org.apache.lucene.search.HitQueue.<init>(HitQueue.java:67)
          at org.apache.lucene.search.PagingCollector.<init>(PagingCollector.java:43)
          at org.apache.lucene.search.PagingCollector.<init>(PagingCollector.java:39)
          at org.apache.lucene.search.IterablePaging$PagingIterator.search(IterablePaging.java:158)
          at org.apache.lucene.search.IterablePaging$PagingIterator.<init>(IterablePaging.java:151)
          at org.apache.lucene.search.IterablePaging.iterator(IterablePaging.java:140)
          at ...CombinedLuceneDBStep.proceed(CombinedLuceneDBStep.java:71)

          I use it like this:
          MultiSearcher ms = new MultiSearcher(indexes);
          TotalHitsRef totalHitsRef = new TotalHitsRef();
          ProgressRef progressRef = new ProgressRef();
          IterablePaging paging = new IterablePaging(ms, lucquery, NB_LUCENE_HITS_PER_BATCH);

          I have no clue where the issue lies, I am using MultiSearcher , and norms are disabled. Or maybe I screwed up something while upgrading to 3.0... I got the files as of feb 11th.

          Show
          javi added a comment - Kudos Aaron, this is cool for what I need. I just integrated in my project, upgraded to 3.0 just to get this in. But I am having an issue in my first test: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:96) at org.apache.lucene.search.HitQueue.<init>(HitQueue.java:67) at org.apache.lucene.search.PagingCollector.<init>(PagingCollector.java:43) at org.apache.lucene.search.PagingCollector.<init>(PagingCollector.java:39) at org.apache.lucene.search.IterablePaging$PagingIterator.search(IterablePaging.java:158) at org.apache.lucene.search.IterablePaging$PagingIterator.<init>(IterablePaging.java:151) at org.apache.lucene.search.IterablePaging.iterator(IterablePaging.java:140) at ...CombinedLuceneDBStep.proceed(CombinedLuceneDBStep.java:71) I use it like this: MultiSearcher ms = new MultiSearcher(indexes); TotalHitsRef totalHitsRef = new TotalHitsRef(); ProgressRef progressRef = new ProgressRef(); IterablePaging paging = new IterablePaging(ms, lucquery, NB_LUCENE_HITS_PER_BATCH); I have no clue where the issue lies, I am using MultiSearcher , and norms are disabled. Or maybe I screwed up something while upgrading to 3.0... I got the files as of feb 11th.
          Hide
          Aaron McCurry added a comment -

          Not sure if anyone has looked at this yet, but I just fixed a minor issue with the IterablePaging.java class. If you did not specify a gather amount, it wouldn't gather any hits.

          Show
          Aaron McCurry added a comment - Not sure if anyone has looked at this yet, but I just fixed a minor issue with the IterablePaging.java class. If you did not specify a gather amount, it wouldn't gather any hits.
          Hide
          Adam Heinz added a comment -

          Awesome, thanks! I'll schedule some time in the coming week to patch our dev installation and sic some QA guys on it.

          Show
          Adam Heinz added a comment - Awesome, thanks! I'll schedule some time in the coming week to patch our dev installation and sic some QA guys on it.
          Hide
          Aaron McCurry added a comment -

          This is a lucene 3.0 paging collector. This should allow for paging through any amount of results without exhausting the jvm's memory.

          Let me know if you have any questions or would like me to make any mods.

          http://www.nearinfinity.com/blogs/aaron_mccurry/making_lucene_hit_results_pagi.html

          Show
          Aaron McCurry added a comment - This is a lucene 3.0 paging collector. This should allow for paging through any amount of results without exhausting the jvm's memory. Let me know if you have any questions or would like me to make any mods. http://www.nearinfinity.com/blogs/aaron_mccurry/making_lucene_hit_results_pagi.html
          Hide
          Aaron McCurry added a comment -

          I will try to get my patch ready this weekend. Otherwise it will be first of the week.

          Show
          Aaron McCurry added a comment - I will try to get my patch ready this weekend. Otherwise it will be first of the week.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Adam Heinz
            • Votes:
              5 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development