Lucene - Core
  1. Lucene - Core
  2. LUCENE-5418

Don't use .advance on costly (e.g. distance range facets) filters

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you use a distance filter today (see http://blog.mikemccandless.com/2014/01/geospatial-distance-faceting-using.html ), then drill down on one of those ranges, under the hood Lucene is using .advance on the Filter, which is very costly because we end up computing distance on (possibly many) hits that don't match the query.

      It's better performance to find the hits matching the Query first, and then check the filter.

      FilteredQuery can already do this today, when you use its QUERY_FIRST_FILTER_STRATEGY. This essentially accomplishes the same thing as Solr's "post filters" (I think?) but with a far simpler/better/less code approach.

      E.g., I believe ElasticSearch uses this API when it applies costly filters.

      Longish term, I think Query/Filter ought to know itself that it's expensive, and cases where such a Query/Filter is MUST'd onto a BooleanQuery (e.g. ConstantScoreQuery), or the Filter is a clause in BooleanFilter, or it's passed to IndexSearcher.search, we should also be "smart" here and not call .advance on such clauses. But that'd be a biggish change ... so for today the "workaround" is the user must carefully construct the FilteredQuery themselves.

      In the mean time, as another workaround, I want to fix DrillSideways so that when you drill down on such filters it doesn't use .advance; this should give a good speedup for the "normal path" API usage with a costly filter.

      I'm iterating on the lucene server branch (LUCENE-5376) but once it's working I plan to merge this back to trunk / 4.7.

      1. LUCENE-5418.patch
        98 kB
        Michael McCandless
      2. LUCENE-5418.patch
        74 kB
        Michael McCandless

        Activity

        Hide
        ASF subversion and git services added a comment -

        Commit 1562127 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1562127 ]

        LUCENE-5418: fix drill-down & drill-sideways to use Bits interface when applying a Filter

        Show
        ASF subversion and git services added a comment - Commit 1562127 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1562127 ] LUCENE-5418 : fix drill-down & drill-sideways to use Bits interface when applying a Filter
        Hide
        Michael McCandless added a comment -

        Patch, I think it's ready.

        I simplified DrillSideways by removing the collector method and cutting over to DISI (not DocsEnum), and then using Bits.get from a Filter when it supports random access. I also cutover DrillDownQuery to FilteredQuery.

        Show
        Michael McCandless added a comment - Patch, I think it's ready. I simplified DrillSideways by removing the collector method and cutting over to DISI (not DocsEnum), and then using Bits.get from a Filter when it supports random access. I also cutover DrillDownQuery to FilteredQuery.
        Hide
        Robert Muir added a comment -

        The workaround seems pretty complicated.

        I don't think we should have slow advance() implementations that run in linear time, thats the real issue, it would be better if advance() was an optional operation [add boolean hasAdvance() method or something], and they just threw UOE instead.

        Show
        Robert Muir added a comment - The workaround seems pretty complicated. I don't think we should have slow advance() implementations that run in linear time, thats the real issue, it would be better if advance() was an optional operation [add boolean hasAdvance() method or something] , and they just threw UOE instead.
        Hide
        Michael McCandless added a comment -

        it would be better if advance() was an optional operation [add boolean hasAdvance() method or something], and they just threw UOE instead.

        +1

        And, fix our queries to handle this properly; e.g. BS2 should detect when a CSQ(Filter) cannot do advance, and (somehow) pull its Bits instead.

        At one point in my patch I was doing this w/ the slow Filters ... but then I was foiled by FilteredQuery.rewrite (it detects when the base query is MatchAll and switches back to the iterator) ... maybe we can remove that rewrite "smartness" and then I'll go back to throwing UOE.

        Show
        Michael McCandless added a comment - it would be better if advance() was an optional operation [add boolean hasAdvance() method or something] , and they just threw UOE instead. +1 And, fix our queries to handle this properly; e.g. BS2 should detect when a CSQ(Filter) cannot do advance, and (somehow) pull its Bits instead. At one point in my patch I was doing this w/ the slow Filters ... but then I was foiled by FilteredQuery.rewrite (it detects when the base query is MatchAll and switches back to the iterator) ... maybe we can remove that rewrite "smartness" and then I'll go back to throwing UOE.
        Hide
        Robert Muir added a comment -

        The same applies to next() really: so the whole thing is that for such filters, we should only be using Bits.

        Show
        Robert Muir added a comment - The same applies to next() really: so the whole thing is that for such filters, we should only be using Bits.
        Hide
        ASF subversion and git services added a comment -

        Commit 1562508 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1562508 ]

        LUCENE-5418: throw UOE from Filter.docIdSet again

        Show
        ASF subversion and git services added a comment - Commit 1562508 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1562508 ] LUCENE-5418 : throw UOE from Filter.docIdSet again
        Hide
        David Smiley added a comment -

        How ironic that I was contemplating this very same issue yesterday (shared on IRC #lucene-dev) as I work on LUCENE-5408 and now I see you guys were just thinking about it. Rob's right; the problem isn't just advance(), it's next() too.

        There may be a place to share some code that Mike is committing here in his facet module with a static utility class I coded yesterday in LUCENE-5408 (not yet posted). It's a BitsDocIdSet and it's roughly similar to Mike's SlowBitsDocIdSetIterator:

        
          /** Utility class that wraps a {@link Bits} with a {@link DocIdSet}. */
          private static class BitsDocIdSet extends DocIdSet {
            final Bits bits;//not null
        
            public BitsDocIdSet(Bits bits) {
              if (bits == null)
                throw new NullPointerException("bits arg should be non-null");
              this.bits = bits;
            }
        
            @Override
            public DocIdSetIterator iterator() throws IOException {
              return new DocIdSetIterator() {
                final Bits bits = BitsDocIdSet.this.bits;//copy reference to reduce outer class access
                int docId = -1;
        
                @Override
                public int docID() {
                  return docId;
                }
        
                @Override
                public int nextDoc() throws IOException {
                  return advance(docId + 1);
                }
        
                @Override
                public int advance(int target) throws IOException {
                  for (docId = target; docId < bits.length(); docId++) {
                    if (bits.get(docId))
                      return docId;
                  }
                  return NO_MORE_DOCS;
                }
        
                @Override
                public long cost() {
                  return bits.length();
                }
              };
            }
        
            @Override
            public Bits bits() throws IOException {
              return bits;//won't be null
            }
        
            //we don't override isCacheable because we want the default of false
          }//class BitsDocIdSet
        

        So Mike; you've got just the DISI portion, and you're also incorporating acceptDocs. For me I elected to have acceptDocs be pre-incorporated into the Bits I pass through. I'll post my intermediate progress on LUCENE-5408. So any way; how about we have something in the "utils" package to share?

        Show
        David Smiley added a comment - How ironic that I was contemplating this very same issue yesterday (shared on IRC #lucene-dev) as I work on LUCENE-5408 and now I see you guys were just thinking about it. Rob's right; the problem isn't just advance(), it's next() too. There may be a place to share some code that Mike is committing here in his facet module with a static utility class I coded yesterday in LUCENE-5408 (not yet posted). It's a BitsDocIdSet and it's roughly similar to Mike's SlowBitsDocIdSetIterator: /** Utility class that wraps a {@link Bits} with a {@link DocIdSet}. */ private static class BitsDocIdSet extends DocIdSet { final Bits bits; //not null public BitsDocIdSet(Bits bits) { if (bits == null ) throw new NullPointerException( "bits arg should be non- null " ); this .bits = bits; } @Override public DocIdSetIterator iterator() throws IOException { return new DocIdSetIterator() { final Bits bits = BitsDocIdSet. this .bits; //copy reference to reduce outer class access int docId = -1; @Override public int docID() { return docId; } @Override public int nextDoc() throws IOException { return advance(docId + 1); } @Override public int advance( int target) throws IOException { for (docId = target; docId < bits.length(); docId++) { if (bits.get(docId)) return docId; } return NO_MORE_DOCS; } @Override public long cost() { return bits.length(); } }; } @Override public Bits bits() throws IOException { return bits; //won't be null } //we don't override isCacheable because we want the default of false } //class BitsDocIdSet So Mike; you've got just the DISI portion, and you're also incorporating acceptDocs. For me I elected to have acceptDocs be pre-incorporated into the Bits I pass through. I'll post my intermediate progress on LUCENE-5408 . So any way; how about we have something in the "utils" package to share?
        Hide
        Michael McCandless added a comment -

        Actually, I've removed the SlowBitsDocIdSetIterator (need to post a patch again soon...), because it's too trappy. I think it's better if the user gets an exception here than just silently run super slowly, and at least for this issue there are always ways to run the Filter "quickly" (use DrillSideways or DrillDownQuery, or create FilteredQuery directly).

        Show
        Michael McCandless added a comment - Actually, I've removed the SlowBitsDocIdSetIterator (need to post a patch again soon...), because it's too trappy. I think it's better if the user gets an exception here than just silently run super slowly, and at least for this issue there are always ways to run the Filter "quickly" (use DrillSideways or DrillDownQuery, or create FilteredQuery directly).
        Hide
        ASF subversion and git services added a comment -

        Commit 1563195 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1563195 ]

        LUCENE-5418: add optional fast-match Filter to range faceting / drill down / drill sideways

        Show
        ASF subversion and git services added a comment - Commit 1563195 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1563195 ] LUCENE-5418 : add optional fast-match Filter to range faceting / drill down / drill sideways
        Hide
        Michael McCandless added a comment -

        New patch, throwing UOE from DocIdSet.iterator() for the Filter returned by Range.getFilter(). I like this approach: it's safer for the user so they don't accidentally apply a super slow filter.

        I also added "fastmatch" Filter to range faceting, so if e.g. your ValueSource is costly to compute, you can pass a fastmatch filter to rule out docs that should not even be considered.

        Show
        Michael McCandless added a comment - New patch, throwing UOE from DocIdSet.iterator() for the Filter returned by Range.getFilter(). I like this approach: it's safer for the user so they don't accidentally apply a super slow filter. I also added "fastmatch" Filter to range faceting, so if e.g. your ValueSource is costly to compute, you can pass a fastmatch filter to rule out docs that should not even be considered.
        Hide
        ASF subversion and git services added a comment -

        Commit 1564765 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1564765 ]

        LUCENE-5418: fix typo

        Show
        ASF subversion and git services added a comment - Commit 1564765 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1564765 ] LUCENE-5418 : fix typo
        Hide
        Michael McCandless added a comment -

        I think the last patch is ready except for a small typo ... I'll commit soon.

        Show
        Michael McCandless added a comment - I think the last patch is ready except for a small typo ... I'll commit soon.
        Hide
        ASF subversion and git services added a comment -

        Commit 1565387 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1565387 ]

        LUCENE-5418: faster drill-down/sideways on costly filters

        Show
        ASF subversion and git services added a comment - Commit 1565387 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1565387 ] LUCENE-5418 : faster drill-down/sideways on costly filters
        Hide
        ASF subversion and git services added a comment -

        Commit 1565391 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1565391 ]

        LUCENE-5418: faster drill-down/sideways on costly filters

        Show
        ASF subversion and git services added a comment - Commit 1565391 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1565391 ] LUCENE-5418 : faster drill-down/sideways on costly filters
        Hide
        ASF subversion and git services added a comment -

        Commit 1665972 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1665972 ]

        LUCENE-5418: Remove incorrect javadocs relict

        Show
        ASF subversion and git services added a comment - Commit 1665972 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1665972 ] LUCENE-5418 : Remove incorrect javadocs relict
        Hide
        ASF subversion and git services added a comment -

        Commit 1665973 from Uwe Schindler in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1665973 ]

        Merged revision(s) 1665972 from lucene/dev/trunk:
        LUCENE-5418: Remove incorrect javadocs relic

        Show
        ASF subversion and git services added a comment - Commit 1665973 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665973 ] Merged revision(s) 1665972 from lucene/dev/trunk: LUCENE-5418 : Remove incorrect javadocs relic

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development