Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0), 6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      FiteredQuery had 3 ways of running conjunctions: leap-frog, query first and random-access filter. We still use leap-frog for conjunctions and we now have a better "query-first" strategy through two-phase iteration. However, we don't have any equivalent for the random-access filter strategy.

      1. LUCENE-7339.patch
        15 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch: it looks at the iterators to intersect and applies all of them that are an instance of BitSetIterator in a random-access fashion. There is one exception: if the iterator that has the minimum cost in the list is a BitSetIterator, then this iterator will still be consumed using the regular nextDoc/advance DISI API. This makes the optimization safe since we still use the iterator with the least cost in order to "lead" the iteration.

        I also changed the query cache to use a FixedBitSet rather than a RoaringDocIdSet when the density is greater than 1% (which was the cut-off used by RandomAccessFilterStrategy). This should help both because FixedBitSet is faster on dense sets (1% is -2 on the x axis at http://people.apache.org/~jpountz/doc_id_sets6.html) and also because it will enable this random-access optimization more often.

        I had to hack a bit luceneutil in order to generate conjunctions with an iterator over a BitSet, which I did by creating a conjunction over a term query and a numeric range. Here is the patch I applied to luceneutil:

        diff --git a/src/main/perf/TaskParser.java b/src/main/perf/TaskParser.java
        index 8397b3a..2365159 100644
        --- a/src/main/perf/TaskParser.java
        +++ b/src/main/perf/TaskParser.java
        @@ -244,6 +244,29 @@ class TaskParser {
                 query = IntPoint.newRangeQuery(nrqFieldName, start, end);
                 sort = null;
                 group = null;
        +      } else if (text.startsWith("filtered_nrq//")) {
        +        // field start end
        +        final int spot3 = text.indexOf(' ');
        +        if (spot3 == -1) {
        +          throw new RuntimeException("failed to parse query=" + text);
        +        }
        +        final int spot4 = text.indexOf(' ', spot3+1);
        +        if (spot4 == -1) {
        +          throw new RuntimeException("failed to parse query=" + text);
        +        }
        +        final int spot5 = text.indexOf(' ', spot4+1);
        +        if (spot5 == -1) {
        +          throw new RuntimeException("failed to parse query=" + text);
        +        }
        +        final String nrqFieldName = text.substring("filtered_nrq//".length(), spot3);
        +        final int start = Integer.parseInt(text.substring(1+spot3, spot4));
        +        final int end = Integer.parseInt(text.substring(1+spot4, spot5));
        +        query = new BooleanQuery.Builder()
        +            .add(new TermQuery(new Term("body", text.substring(1+spot5))), Occur.MUST)
        +            .add(IntPoint.newRangeQuery(nrqFieldName, start, end), Occur.FILTER)
        +            .build();
        +        sort = null;
        +        group = null;
               } else if (text.startsWith("datetimesort//")) {
                 throw new IllegalArgumentException("use lastmodndvsort instead");
               } else if (text.startsWith("titlesort//")) {
        diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks
        index 342070c..983361f 100644
        --- a/tasks/wikimedium.10M.nostopwords.tasks
        +++ b/tasks/wikimedium.10M.nostopwords.tasks
        @@ -13361,3 +13361,20 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
         OrHighNotLow: do -necessities # freq=511178 freq=1195
         OrNotHighLow: -had halfback # freq=1246743 freq=1205
         OrHighNotLow: had -halfback # freq=1246743 freq=1205
        +FilteredIntNRQ: filtered_nrq//timesecnum 6207 55832 ref
        +FilteredIntNRQ: filtered_nrq//timesecnum 53 85622 http
        +FilteredIntNRQ: filtered_nrq//timesecnum 2669 66142 from
        +FilteredIntNRQ: filtered_nrq//timesecnum 9936 85687 name
        +FilteredIntNRQ: filtered_nrq//timesecnum 23189 61377 title
        +FilteredIntNRQ: filtered_nrq//timesecnum 7624 69351 date
        +FilteredIntNRQ: filtered_nrq//timesecnum 15733 85583 which
        +FilteredIntNRQ: filtered_nrq//timesecnum 8791 69420 also
        +FilteredIntNRQ: filtered_nrq//timesecnum 6125 46693 first
        +FilteredIntNRQ: filtered_nrq//timesecnum 8006 80120 his
        +FilteredIntNRQ: filtered_nrq//timesecnum 11514 45063 cite
        +FilteredIntNRQ: filtered_nrq//timesecnum 6342 72089 he
        +FilteredIntNRQ: filtered_nrq//timesecnum 10670 66864 new
        +FilteredIntNRQ: filtered_nrq//timesecnum 2896 83864 1
        +FilteredIntNRQ: filtered_nrq//timesecnum 8947 64612 s
        +FilteredIntNRQ: filtered_nrq//timesecnum 2808 75217 2
        +FilteredIntNRQ: filtered_nrq//timesecnum 388 84762 one
        

        I got a ~6% speedup when testing it on wikimedium10m:

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                         LowTerm      686.83      (7.6%)      677.34      (8.5%)   -1.4% ( -16% -   15%)
                HighSloppyPhrase       19.29      (4.3%)       19.03      (4.5%)   -1.3% (  -9% -    7%)
                         Prefix3      159.38      (5.0%)      157.87      (5.0%)   -0.9% ( -10% -    9%)
                    OrNotHighMed      191.55      (2.2%)      189.74      (2.3%)   -0.9% (  -5% -    3%)
                        Wildcard       99.07      (5.9%)       98.18      (5.9%)   -0.9% ( -11% -   11%)
                     MedSpanNear       38.50      (3.1%)       38.15      (3.2%)   -0.9% (  -7% -    5%)
                       MedPhrase       88.65      (2.6%)       87.88      (1.8%)   -0.9% (  -5% -    3%)
                 LowSloppyPhrase       59.35      (2.8%)       58.95      (2.9%)   -0.7% (  -6% -    5%)
                   OrHighNotHigh       49.32      (4.0%)       49.00      (4.2%)   -0.7% (  -8% -    7%)
                    OrHighNotMed       86.14      (4.8%)       85.60      (5.4%)   -0.6% ( -10% -   10%)
                    OrHighNotLow       50.89      (6.1%)       50.58      (5.7%)   -0.6% ( -11% -   11%)
                 MedSloppyPhrase       76.49      (2.0%)       76.05      (1.9%)   -0.6% (  -4% -    3%)
                     AndHighHigh      119.69      (1.2%)      119.07      (2.1%)   -0.5% (  -3% -    2%)
                    OrNotHighLow     1060.23      (3.2%)     1055.14      (3.9%)   -0.5% (  -7% -    6%)
                     LowSpanNear      229.73      (2.9%)      228.70      (3.0%)   -0.4% (  -6% -    5%)
                       LowPhrase      151.17      (2.2%)      150.49      (3.0%)   -0.4% (  -5% -    4%)
                      HighPhrase       43.96      (1.7%)       43.79      (2.3%)   -0.4% (  -4% -    3%)
                    HighSpanNear        2.23      (7.1%)        2.22      (7.4%)   -0.4% ( -13% -   15%)
                          IntNRQ       12.11      (8.9%)       12.07      (8.9%)   -0.3% ( -16% -   19%)
                   OrNotHighHigh       67.08      (3.9%)       66.92      (3.9%)   -0.2% (  -7% -    7%)
                         MedTerm      157.87      (5.6%)      157.54      (5.1%)   -0.2% ( -10% -   11%)
                        HighTerm       76.31      (5.8%)       76.16      (5.5%)   -0.2% ( -10% -   11%)
                         Respell       74.16      (2.3%)       74.01      (3.2%)   -0.2% (  -5% -    5%)
                      AndHighMed      144.89      (1.8%)      144.62      (1.5%)   -0.2% (  -3% -    3%)
                      OrHighHigh       31.12      (6.2%)       31.18      (5.7%)    0.2% ( -11% -   12%)
                       OrHighMed       38.73      (6.0%)       38.84      (5.4%)    0.3% ( -10% -   12%)
                       OrHighLow      102.78      (4.0%)      103.09      (4.0%)    0.3% (  -7% -    8%)
                          Fuzzy1       75.62     (16.4%)       76.64     (12.7%)    1.4% ( -23% -   36%)
                      AndHighLow      785.48      (4.2%)      796.18      (3.5%)    1.4% (  -6% -    9%)
                          Fuzzy2       45.30     (16.5%)       47.03     (19.4%)    3.8% ( -27% -   47%)
                  FilteredIntNRQ       10.57      (4.0%)       11.20      (4.4%)    5.9% (  -2% -   14%)
        
        Show
        jpountz Adrien Grand added a comment - Here is a patch: it looks at the iterators to intersect and applies all of them that are an instance of BitSetIterator in a random-access fashion. There is one exception: if the iterator that has the minimum cost in the list is a BitSetIterator, then this iterator will still be consumed using the regular nextDoc/advance DISI API. This makes the optimization safe since we still use the iterator with the least cost in order to "lead" the iteration. I also changed the query cache to use a FixedBitSet rather than a RoaringDocIdSet when the density is greater than 1% (which was the cut-off used by RandomAccessFilterStrategy). This should help both because FixedBitSet is faster on dense sets (1% is -2 on the x axis at http://people.apache.org/~jpountz/doc_id_sets6.html ) and also because it will enable this random-access optimization more often. I had to hack a bit luceneutil in order to generate conjunctions with an iterator over a BitSet, which I did by creating a conjunction over a term query and a numeric range. Here is the patch I applied to luceneutil: diff --git a/src/main/perf/TaskParser.java b/src/main/perf/TaskParser.java index 8397b3a..2365159 100644 --- a/src/main/perf/TaskParser.java +++ b/src/main/perf/TaskParser.java @@ -244,6 +244,29 @@ class TaskParser { query = IntPoint.newRangeQuery(nrqFieldName, start, end); sort = null; group = null; + } else if (text.startsWith("filtered_nrq//")) { + // field start end + final int spot3 = text.indexOf(' '); + if (spot3 == -1) { + throw new RuntimeException("failed to parse query=" + text); + } + final int spot4 = text.indexOf(' ', spot3+1); + if (spot4 == -1) { + throw new RuntimeException("failed to parse query=" + text); + } + final int spot5 = text.indexOf(' ', spot4+1); + if (spot5 == -1) { + throw new RuntimeException("failed to parse query=" + text); + } + final String nrqFieldName = text.substring("filtered_nrq//".length(), spot3); + final int start = Integer.parseInt(text.substring(1+spot3, spot4)); + final int end = Integer.parseInt(text.substring(1+spot4, spot5)); + query = new BooleanQuery.Builder() + .add(new TermQuery(new Term("body", text.substring(1+spot5))), Occur.MUST) + .add(IntPoint.newRangeQuery(nrqFieldName, start, end), Occur.FILTER) + .build(); + sort = null; + group = null; } else if (text.startsWith("datetimesort//")) { throw new IllegalArgumentException("use lastmodndvsort instead"); } else if (text.startsWith("titlesort//")) { diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks index 342070c..983361f 100644 --- a/tasks/wikimedium.10M.nostopwords.tasks +++ b/tasks/wikimedium.10M.nostopwords.tasks @@ -13361,3 +13361,20 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195 OrHighNotLow: do -necessities # freq=511178 freq=1195 OrNotHighLow: -had halfback # freq=1246743 freq=1205 OrHighNotLow: had -halfback # freq=1246743 freq=1205 +FilteredIntNRQ: filtered_nrq//timesecnum 6207 55832 ref +FilteredIntNRQ: filtered_nrq//timesecnum 53 85622 http +FilteredIntNRQ: filtered_nrq//timesecnum 2669 66142 from +FilteredIntNRQ: filtered_nrq//timesecnum 9936 85687 name +FilteredIntNRQ: filtered_nrq//timesecnum 23189 61377 title +FilteredIntNRQ: filtered_nrq//timesecnum 7624 69351 date +FilteredIntNRQ: filtered_nrq//timesecnum 15733 85583 which +FilteredIntNRQ: filtered_nrq//timesecnum 8791 69420 also +FilteredIntNRQ: filtered_nrq//timesecnum 6125 46693 first +FilteredIntNRQ: filtered_nrq//timesecnum 8006 80120 his +FilteredIntNRQ: filtered_nrq//timesecnum 11514 45063 cite +FilteredIntNRQ: filtered_nrq//timesecnum 6342 72089 he +FilteredIntNRQ: filtered_nrq//timesecnum 10670 66864 new +FilteredIntNRQ: filtered_nrq//timesecnum 2896 83864 1 +FilteredIntNRQ: filtered_nrq//timesecnum 8947 64612 s +FilteredIntNRQ: filtered_nrq//timesecnum 2808 75217 2 +FilteredIntNRQ: filtered_nrq//timesecnum 388 84762 one I got a ~6% speedup when testing it on wikimedium10m: TaskQPS baseline StdDev QPS patch StdDev Pct diff LowTerm 686.83 (7.6%) 677.34 (8.5%) -1.4% ( -16% - 15%) HighSloppyPhrase 19.29 (4.3%) 19.03 (4.5%) -1.3% ( -9% - 7%) Prefix3 159.38 (5.0%) 157.87 (5.0%) -0.9% ( -10% - 9%) OrNotHighMed 191.55 (2.2%) 189.74 (2.3%) -0.9% ( -5% - 3%) Wildcard 99.07 (5.9%) 98.18 (5.9%) -0.9% ( -11% - 11%) MedSpanNear 38.50 (3.1%) 38.15 (3.2%) -0.9% ( -7% - 5%) MedPhrase 88.65 (2.6%) 87.88 (1.8%) -0.9% ( -5% - 3%) LowSloppyPhrase 59.35 (2.8%) 58.95 (2.9%) -0.7% ( -6% - 5%) OrHighNotHigh 49.32 (4.0%) 49.00 (4.2%) -0.7% ( -8% - 7%) OrHighNotMed 86.14 (4.8%) 85.60 (5.4%) -0.6% ( -10% - 10%) OrHighNotLow 50.89 (6.1%) 50.58 (5.7%) -0.6% ( -11% - 11%) MedSloppyPhrase 76.49 (2.0%) 76.05 (1.9%) -0.6% ( -4% - 3%) AndHighHigh 119.69 (1.2%) 119.07 (2.1%) -0.5% ( -3% - 2%) OrNotHighLow 1060.23 (3.2%) 1055.14 (3.9%) -0.5% ( -7% - 6%) LowSpanNear 229.73 (2.9%) 228.70 (3.0%) -0.4% ( -6% - 5%) LowPhrase 151.17 (2.2%) 150.49 (3.0%) -0.4% ( -5% - 4%) HighPhrase 43.96 (1.7%) 43.79 (2.3%) -0.4% ( -4% - 3%) HighSpanNear 2.23 (7.1%) 2.22 (7.4%) -0.4% ( -13% - 15%) IntNRQ 12.11 (8.9%) 12.07 (8.9%) -0.3% ( -16% - 19%) OrNotHighHigh 67.08 (3.9%) 66.92 (3.9%) -0.2% ( -7% - 7%) MedTerm 157.87 (5.6%) 157.54 (5.1%) -0.2% ( -10% - 11%) HighTerm 76.31 (5.8%) 76.16 (5.5%) -0.2% ( -10% - 11%) Respell 74.16 (2.3%) 74.01 (3.2%) -0.2% ( -5% - 5%) AndHighMed 144.89 (1.8%) 144.62 (1.5%) -0.2% ( -3% - 3%) OrHighHigh 31.12 (6.2%) 31.18 (5.7%) 0.2% ( -11% - 12%) OrHighMed 38.73 (6.0%) 38.84 (5.4%) 0.3% ( -10% - 12%) OrHighLow 102.78 (4.0%) 103.09 (4.0%) 0.3% ( -7% - 8%) Fuzzy1 75.62 (16.4%) 76.64 (12.7%) 1.4% ( -23% - 36%) AndHighLow 785.48 (4.2%) 796.18 (3.5%) 1.4% ( -6% - 9%) Fuzzy2 45.30 (16.5%) 47.03 (19.4%) 3.8% ( -27% - 47%) FilteredIntNRQ 10.57 (4.0%) 11.20 (4.4%) 5.9% ( -2% - 14%)
        Hide
        thetaphi Uwe Schindler added a comment -

        Thanks for reimplementing! I am trying to understand it, will report later!

        } else if (disi.getClass() == BitSetConjunctionDISI.class) {
        

        Why not instanceof?

        Show
        thetaphi Uwe Schindler added a comment - Thanks for reimplementing! I am trying to understand it, will report later! } else if (disi.getClass() == BitSetConjunctionDISI.class) { Why not instanceof ?
        Hide
        jpountz Adrien Grand added a comment -

        Thanks for looking Uwe! I guess the class comparison is mainly me being paranoid. In that case it does not matter since BitSetConjunctionDISI is final, but otherwise instanceof might also match sub classes. I can change it if you like instanceof better.

        Show
        jpountz Adrien Grand added a comment - Thanks for looking Uwe! I guess the class comparison is mainly me being paranoid. In that case it does not matter since BitSetConjunctionDISI is final, but otherwise instanceof might also match sub classes. I can change it if you like instanceof better.
        Hide
        jpountz Adrien Grand added a comment -

        Uwe Schindler Have you had a chance to give it a deeper look?

        Show
        jpountz Adrien Grand added a comment - Uwe Schindler Have you had a chance to give it a deeper look?
        Hide
        thetaphi Uwe Schindler added a comment - - edited

        Hi,
        yes. I checked the createConjunction code, looks fine to me.

        I have just a minor "simplification". Finding the minCost can be done with Java 8 streams very easily:

        Instead of:

        long minCost = Long.MAX_VALUE;
        for (DocIdSetIterator iterator : allIterators) {
          minCost = Math.min(minCost, iterator.cost());
        }
        

        Do in a one-liner with method-references and streams:

        long minCost = allIterators.stream().mapToLong(DocIdSetIterator::cost).min().getAsLong();
        

        Test with the anonymous class looks fine!

        Show
        thetaphi Uwe Schindler added a comment - - edited Hi, yes. I checked the createConjunction code, looks fine to me. I have just a minor "simplification". Finding the minCost can be done with Java 8 streams very easily: Instead of: long minCost = Long .MAX_VALUE; for (DocIdSetIterator iterator : allIterators) { minCost = Math .min(minCost, iterator.cost()); } Do in a one-liner with method-references and streams: long minCost = allIterators.stream().mapToLong(DocIdSetIterator::cost).min().getAsLong(); Test with the anonymous class looks fine!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 60dcef6c61b0a72f4a7bb6e9b27a0073464f53c5 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=60dcef6 ]

        LUCENE-7339: Specialize conjunction with bit sets.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 60dcef6c61b0a72f4a7bb6e9b27a0073464f53c5 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=60dcef6 ] LUCENE-7339 : Specialize conjunction with bit sets.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7afa3333c654b6423563e6dd1cd5478812924148 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7afa333 ]

        LUCENE-7339: Specialize conjunction with bit sets.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7afa3333c654b6423563e6dd1cd5478812924148 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7afa333 ] LUCENE-7339 : Specialize conjunction with bit sets.
        Hide
        jpountz Adrien Grand added a comment -

        Thanks Uwe, I applied the simplification you suggested.

        Show
        jpountz Adrien Grand added a comment - Thanks Uwe, I applied the simplification you suggested.
        Hide
        thetaphi Uwe Schindler added a comment -

        All fine Thanks for committing!

        Show
        thetaphi Uwe Schindler added a comment - All fine Thanks for committing!
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            jpountz Adrien Grand
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development