Lucene - Core
  1. Lucene - Core
  2. LUCENE-4410

Make FilteredQuery more flexible with regards to how filters are applied

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.1, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently FilteredQuery uses either the "old" lucene 3 leap frog approach or pushes the filter down together with accepted docs. Yet there might be more strategies required to fit common usecases like geo-filtering where a rather costly function is applied to each document. Using leap frog this might result in a very slow query if the filter is advanced since it might have linear running time to find the next valid document. We should be more flexible with regards to those usecases and make it possible to either tell FQ what to do or plug in a strategy that applied a filter in a different way.

      The current FQ impl also uses an heuristic to decide if RA or LeapFrog should be used. This is really an implementation detail of the strategy and not of FQ and should be moved out.

      1. LUCENE-4410.patch
        38 kB
        Simon Willnauer
      2. LUCENE-4410.patch
        38 kB
        Simon Willnauer
      3. LUCENE-4410.patch
        38 kB
        Simon Willnauer
      4. LUCENE-4410.patch
        28 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          This patch introduces a FilterStrategy similar to MTQ#RewriteMethod that wraps a given scorer & DocIDSet into a "Filtered Scorer". By default FilteredQuery uses the RandomAccessFilterStrategy that falls back to leap frog just like FQ did before. In contrast to the current FQ #useRandomAccess is now an impl detail of the RAFilterStrategy and might be removed with a different implementation that might rely on Scorer#cost() or something like that.

          I also added a DocFirstFilterStrategy that applies filters only iff the delegate scorer matched. This seems more consistent with what other queries do (ie. MTQ) and provides more flexibility to the user

          Show
          Simon Willnauer added a comment - This patch introduces a FilterStrategy similar to MTQ#RewriteMethod that wraps a given scorer & DocIDSet into a "Filtered Scorer". By default FilteredQuery uses the RandomAccessFilterStrategy that falls back to leap frog just like FQ did before. In contrast to the current FQ #useRandomAccess is now an impl detail of the RAFilterStrategy and might be removed with a different implementation that might rely on Scorer#cost() or something like that. I also added a DocFirstFilterStrategy that applies filters only iff the delegate scorer matched. This seems more consistent with what other queries do (ie. MTQ) and provides more flexibility to the user
          Hide
          Michael McCandless added a comment -

          This sounds great! I'll look more at the patch ...

          But surely this is not a blocker issue?

          We (well, Robert: thanks!) are about to cut the final 4.0 release ... we shouldn't stuff features in at the last minute.

          Show
          Michael McCandless added a comment - This sounds great! I'll look more at the patch ... But surely this is not a blocker issue? We (well, Robert: thanks!) are about to cut the final 4.0 release ... we shouldn't stuff features in at the last minute.
          Hide
          Simon Willnauer added a comment -

          s/blocker/major

          Show
          Simon Willnauer added a comment - s/blocker/major
          Hide
          Simon Willnauer added a comment -

          We (well, Robert: thanks!) are about to cut the final 4.0 release ... we shouldn't stuff features in at the last minute.

          I marked this as a blocker since it really limits to a certain type of filters. I don't think this is a last minute feature really. I'd be totally happy to have only the structural refactoring in Lucene 4.0 and add the DocFirstStrategy later if that is consider a feature though.

          Show
          Simon Willnauer added a comment - We (well, Robert: thanks!) are about to cut the final 4.0 release ... we shouldn't stuff features in at the last minute. I marked this as a blocker since it really limits to a certain type of filters. I don't think this is a last minute feature really. I'd be totally happy to have only the structural refactoring in Lucene 4.0 and add the DocFirstStrategy later if that is consider a feature though.
          Hide
          Robert Muir added a comment -

          I don't agree with blocker either. Should it block 3.6.2 as well? 4.0 is significantly more flexible than 3.x was.

          3.x only had one way to execute a filter: leapfrog.
          4.x has three ways: leapfrog, bits, and auto. the default is auto.

          is.search(new FilteredQuery(q,f) {
            protected boolean useRandomAccess(Bits bits, int firstFilterDoc) {
              return false; // always leapfrog: act like 3.x
            }
          });
          ...
          is.search(new FilteredQuery(q,f) {
            protected boolean useRandomAccess(Bits bits, int firstFilterDoc) {
              return true; // never leapfrog if Bits are available
            }
          });
          

          Separately (unrelated to release management) I like this idea and I think we should do it.

          But as i said on the mailing list, i think we should be tackling bugs, javadocs, and tests and approaching stability.
          it makes me nervous to add more features to our filter execution right before release.

          Show
          Robert Muir added a comment - I don't agree with blocker either. Should it block 3.6.2 as well? 4.0 is significantly more flexible than 3.x was. 3.x only had one way to execute a filter: leapfrog. 4.x has three ways: leapfrog, bits, and auto. the default is auto. is.search( new FilteredQuery(q,f) { protected boolean useRandomAccess(Bits bits, int firstFilterDoc) { return false ; // always leapfrog: act like 3.x } }); ... is.search( new FilteredQuery(q,f) { protected boolean useRandomAccess(Bits bits, int firstFilterDoc) { return true ; // never leapfrog if Bits are available } }); Separately (unrelated to release management) I like this idea and I think we should do it. But as i said on the mailing list, i think we should be tackling bugs, javadocs, and tests and approaching stability. it makes me nervous to add more features to our filter execution right before release.
          Hide
          Uwe Schindler added a comment - - edited

          Hi,
          I am also against rushing this in with 4.0. There is no slowdown in comparison to Lucene 3.6. My problems are also in the patch:

          • I hate the crazy not-really-random access Bits impl in the DocFirstStrategy! It is definitely not random access, so it violates the contract. In this case the standard LeapFrog approach should be used (if not random access DocIdSet).
          • I dont really see an improvement. The Bits interface is not allowed to throw IOException, so the filter should only return a Bits interface if its a very fast random access implementation. In all other cases the filter must not suport Bits and then LeapFrog has to be used.
          • I would only see an improvement if the method Filter.getDocIdSet() is only called after the scorer hit the first document - but this does not work with random access. This was also not done in 3.6, so we should not rush.
          • The ctor strategy param should never be NULL. The ctor without strategy should pass the explicit one down to this(..., Strategy)
          Show
          Uwe Schindler added a comment - - edited Hi, I am also against rushing this in with 4.0. There is no slowdown in comparison to Lucene 3.6. My problems are also in the patch: I hate the crazy not-really-random access Bits impl in the DocFirstStrategy! It is definitely not random access, so it violates the contract. In this case the standard LeapFrog approach should be used (if not random access DocIdSet). I dont really see an improvement. The Bits interface is not allowed to throw IOException, so the filter should only return a Bits interface if its a very fast random access implementation. In all other cases the filter must not suport Bits and then LeapFrog has to be used. I would only see an improvement if the method Filter.getDocIdSet() is only called after the scorer hit the first document - but this does not work with random access. This was also not done in 3.6, so we should not rush. The ctor strategy param should never be NULL. The ctor without strategy should pass the explicit one down to this(..., Strategy)
          Hide
          Uwe Schindler added a comment -

          In addition, stuff like this can also be added in later 4.x versions, as it does not change API of FilteredQuery, it just adds an additional ctor param

          Show
          Uwe Schindler added a comment - In addition, stuff like this can also be added in later 4.x versions, as it does not change API of FilteredQuery, it just adds an additional ctor param
          Hide
          Uwe Schindler added a comment -

          I would aggree to this patch if the following is fixed:

          • remove DocFirst* for now. See above for reasons.
          • only copy-paste current code in the new API
          • make the ctor with strategy not accept null, require it to pass an explicit strategy. The ctor without strategy should pass the explicit one down to this(..., Strategy)
          Show
          Uwe Schindler added a comment - I would aggree to this patch if the following is fixed: remove DocFirst* for now. See above for reasons. only copy-paste current code in the new API make the ctor with strategy not accept null, require it to pass an explicit strategy. The ctor without strategy should pass the explicit one down to this(..., Strategy)
          Hide
          Simon Willnauer added a comment -

          I would aggree to this patch if the following is fixed:

          I think that is a fair game. I think we should keep the DocFirst for trunk and if everybody agrees I merge the changes into 4.0 excluding the additional strategy. I already just copy pasted the current stuff in the the other two strategies. I can even remove the leap frog strategy in 4.x and keep in on trunk so I can use it to fallback to it in the DocFirst. Objections?

          Show
          Simon Willnauer added a comment - I would aggree to this patch if the following is fixed: I think that is a fair game. I think we should keep the DocFirst for trunk and if everybody agrees I merge the changes into 4.0 excluding the additional strategy. I already just copy pasted the current stuff in the the other two strategies. I can even remove the leap frog strategy in 4.x and keep in on trunk so I can use it to fallback to it in the DocFirst. Objections?
          Hide
          Uwe Schindler added a comment -

          Hi,
          I have one more comment for the DocFirst strategy:
          The idea is good because it lets the query drive the collector and we only look up docs, the query hi (using the random access filter). This is sometimes better than passing it down as acceptdocs, because it would slowdown if the Bits interface is expensive and every query subclause must reevaluate the bits.get() method.
          The problem I had with trhe patch is the crazy Bits implementation for the DocFirstStrategy, which had exactly this problem. Also it was not following the random access pattern, because it allowed the Bits.get() calls only in order. I can easily write a BooleanScorer1-like query that violates this (because a query with more than one sub-clause can easily call Bits.get() out of order for each sub-clause).
          The DocFirstStrategy wants the query drive the collection, so the non-bits approach should either use LeapFrog (which may be expensive if the filter has ineffective nextDoc()) or it should also implemen DocFirst in order. I would rename that strategy to QueryFirstStrategy and make 2 scorers for it:

          • a random access one calling Bits.get() for every hit of the query
          • a sequential one that calls nextDoc() only on the Query, never on the filter. The filter is only advanced to the current query doc. By this the filter oly scans through its docs very seldom (when there is no hit after advance).
          Show
          Uwe Schindler added a comment - Hi, I have one more comment for the DocFirst strategy: The idea is good because it lets the query drive the collector and we only look up docs, the query hi (using the random access filter). This is sometimes better than passing it down as acceptdocs, because it would slowdown if the Bits interface is expensive and every query subclause must reevaluate the bits.get() method. The problem I had with trhe patch is the crazy Bits implementation for the DocFirstStrategy, which had exactly this problem. Also it was not following the random access pattern, because it allowed the Bits.get() calls only in order. I can easily write a BooleanScorer1-like query that violates this (because a query with more than one sub-clause can easily call Bits.get() out of order for each sub-clause). The DocFirstStrategy wants the query drive the collection, so the non-bits approach should either use LeapFrog (which may be expensive if the filter has ineffective nextDoc()) or it should also implemen DocFirst in order. I would rename that strategy to QueryFirstStrategy and make 2 scorers for it: a random access one calling Bits.get() for every hit of the query a sequential one that calls nextDoc() only on the Query, never on the filter. The filter is only advanced to the current query doc. By this the filter oly scans through its docs very seldom (when there is no hit after advance).
          Hide
          Simon Willnauer added a comment -

          here is a new patch that uses a QueryFirstLeapFrog if we can't pull bits in the DocFirstStrategy. I generalized the LeapFrog scorer such that it can do query first, filter first and "filter already advanced" I also fixed the parameters in the ctor and added both QUERY_FIRST and FILTER_FIRST strategies to FilterQuery. I hooked that up with Random testing and all tests pass

          Show
          Simon Willnauer added a comment - here is a new patch that uses a QueryFirstLeapFrog if we can't pull bits in the DocFirstStrategy. I generalized the LeapFrog scorer such that it can do query first, filter first and "filter already advanced" I also fixed the parameters in the ctor and added both QUERY_FIRST and FILTER_FIRST strategies to FilterQuery. I hooked that up with Random testing and all tests pass
          Hide
          Simon Willnauer added a comment -

          For the given patch I'd commit that to trunk and backport the FilterStrategy structure without the enhancements to 4.0. That way we only have the API change in 4.0 and no new feature but that would allow us to add all the other stuff in this patch to go into 4.1 without breaking anything.

          Show
          Simon Willnauer added a comment - For the given patch I'd commit that to trunk and backport the FilterStrategy structure without the enhancements to 4.0. That way we only have the API change in 4.0 and no new feature but that would allow us to add all the other stuff in this patch to go into 4.1 without breaking anything.
          Hide
          Uwe Schindler added a comment -

          Hi Simon, now I am happy. This is much better than before. I think this is a real improvement, I have no problem with getting that in. Maybe we commit it to trunk first and let it bake over the weekend.

          PrimaryAdvancedLeapFrogScorer is a good workaround, it take some time until I understaood that this scorer simply "already knows" that it is already placed on the filter's first doc.

          Show
          Uwe Schindler added a comment - Hi Simon, now I am happy. This is much better than before. I think this is a real improvement, I have no problem with getting that in. Maybe we commit it to trunk first and let it bake over the weekend. PrimaryAdvancedLeapFrogScorer is a good workaround, it take some time until I understaood that this scorer simply "already knows" that it is already placed on the filter's first doc.
          Hide
          Uwe Schindler added a comment -

          The special PrimaryAdvancedLeapFrogScorer is final, once it is removed, we can make the original one final. For now I would make all methods except the primaryNext() one final (to help Hotspot).

          Show
          Uwe Schindler added a comment - The special PrimaryAdvancedLeapFrogScorer is final, once it is removed, we can make the original one final. For now I would make all methods except the primaryNext() one final (to help Hotspot).
          Hide
          Simon Willnauer added a comment -

          next iteration, fixing some javadoc issues and adding a changes entry. I also made all methods that should not be subclassed final on LeapFrogScorer so hotspot doesn´t mess around with it. I think its ready. If nobody objects I will commit this to trunk tomorrow morning

          Show
          Simon Willnauer added a comment - next iteration, fixing some javadoc issues and adding a changes entry. I also made all methods that should not be subclassed final on LeapFrogScorer so hotspot doesn´t mess around with it. I think its ready. If nobody objects I will commit this to trunk tomorrow morning
          Hide
          Michael McCandless added a comment -

          I think we should target 4.1 for this? Ie commit to trunk, let it back, ship 4.0, backport.

          Show
          Michael McCandless added a comment - I think we should target 4.1 for this? Ie commit to trunk, let it back, ship 4.0, backport.
          Hide
          Uwe Schindler added a comment -

          We can only do this, if we add "experimental" to the random access detetion method... Otherwise we are in backwards-compatibility hell in 4.1

          I checked the code, it's identical to the one before, just class hierarchy changed... We can maybe only remove the new QueryFirstStrategy, but the other one is 100% identical to current 4.0 code. I see no problem with it going in.

          Show
          Uwe Schindler added a comment - We can only do this, if we add "experimental" to the random access detetion method... Otherwise we are in backwards-compatibility hell in 4.1 I checked the code, it's identical to the one before, just class hierarchy changed... We can maybe only remove the new QueryFirstStrategy, but the other one is 100% identical to current 4.0 code. I see no problem with it going in.
          Hide
          Uwe Schindler added a comment -

          Simon: I dont see the final methods in latest patch. Are you sure you uploaded the right one?

          Show
          Uwe Schindler added a comment - Simon: I dont see the final methods in latest patch. Are you sure you uploaded the right one?
          Hide
          Michael McCandless added a comment -

          We can only do this, if we add "experimental" to the random access detetion method...

          +1

          I just don't think we should be making such biggish changes just before releasing 4.0...

          Show
          Michael McCandless added a comment - We can only do this, if we add "experimental" to the random access detetion method... +1 I just don't think we should be making such biggish changes just before releasing 4.0...
          Hide
          Uwe Schindler added a comment -

          It is not biggish It is same code as before I don't care, but the RA autodetection was always horrible to me, now its hidden behind something else!

          Show
          Uwe Schindler added a comment - It is not biggish It is same code as before I don't care, but the RA autodetection was always horrible to me, now its hidden behind something else!
          Hide
          Simon Willnauer added a comment -

          I attached the wrong patch yesterday - thanks uwe for looking at it

          Show
          Simon Willnauer added a comment - I attached the wrong patch yesterday - thanks uwe for looking at it
          Hide
          Simon Willnauer added a comment -

          I committed this to trunk and added @lucene.experimental to FilterQuery#useRandomAccess on branch4x. I still think we should port the structure to 4x before we release but lets bake it in for a day or two and see how it goes. We are safe to change this in 4.1 now

          Show
          Simon Willnauer added a comment - I committed this to trunk and added @lucene.experimental to FilterQuery#useRandomAccess on branch4x. I still think we should port the structure to 4x before we release but lets bake it in for a day or two and see how it goes. We are safe to change this in 4.1 now
          Hide
          Simon Willnauer added a comment -

          I changed the fix version to 4.1 - I will keep this issue open until this is backported.

          Show
          Simon Willnauer added a comment - I changed the fix version to 4.1 - I will keep this issue open until this is backported.
          Hide
          Simon Willnauer added a comment -

          committed to 4.x in revision 1389462.

          Show
          Simon Willnauer added a comment - committed to 4.x in revision 1389462.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1389462

          LUCENE-4410: Make FilteredQuery more flexible with regards to how filters are applied

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1389462 LUCENE-4410 : Make FilteredQuery more flexible with regards to how filters are applied
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1388393

          LUCENE-4410: made FilteredQuery#useRandomAccess experimental this will change in 4.1

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1388393 LUCENE-4410 : made FilteredQuery#useRandomAccess experimental this will change in 4.1
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development