Lucene - Core
  1. Lucene - Core
  2. LUCENE-6553

Simplify how we handle deleted docs in read APIs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today, all scorers and postings formats need to be able to handle deleted documents.

      I suspect that the reason is that we want to be able to make sure to not perform costly operations on documents that are deleted. For instance if you run a phrase query, reading positions on a document which is deleted is useless. I suspect this is also a source of inefficiencies since in some cases we apply deleted documents several times: for instance conjunctions apply deleted docs to every sub scorer.

      However, with the new two-phase iteration API, we have a way to make sure that we never run expensive operations on deleted documents: we could first iterate over the approximation, then check that the document is not deleted, and finally confirm the match. Since approximations are cheap, applying deleted docs after them would not be an issue.

      I would like to explore removing the "Bits acceptDocs" parameter from TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer, and add it to BulkScorer.score. This way, bulk scorers would be the only API which would need to know how to apply deleted docs, which I think would be more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would be implemented the way described above: first advance the approximation, then check deleted docs, then confirm the match, then collect. Of course that's only in the case the scorer supports approximations, if it does not, it means it is cheap so we can directly iterate the scorer and check deleted docs on top.

      1. LUCENE-6553.patch
        470 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          +1, this sounds wonderful, to stop "special casing" the live docs filter, inefficiently.

          Show
          Michael McCandless added a comment - +1, this sounds wonderful, to stop "special casing" the live docs filter, inefficiently.
          Hide
          Robert Muir added a comment -

          I agree this is definitely worth exploring, it could simplify a lot.

          I suspect this is also a source of inefficiencies since in some cases we apply deleted documents several times: for instance conjunctions apply deleted docs to every sub scorer.

          It also has the side effect of making code complex: trying to avoid these inefficiencies in other cases, such as in advance(). So code is completely duplicated from next() there just to avoid the livedocs check on every document when scanning.

          Show
          Robert Muir added a comment - I agree this is definitely worth exploring, it could simplify a lot. I suspect this is also a source of inefficiencies since in some cases we apply deleted documents several times: for instance conjunctions apply deleted docs to every sub scorer. It also has the side effect of making code complex: trying to avoid these inefficiencies in other cases, such as in advance(). So code is completely duplicated from next() there just to avoid the livedocs check on every document when scanning.
          Hide
          Adrien Grand added a comment -

          I started playing with this idea, and the only issue in core so far is FilteredQuery.RANDOM_ACCESS_STRATEGY which wants to apply a Filter as acceptDocs, which would not be supported anymore with this change.

          Show
          Adrien Grand added a comment - I started playing with this idea, and the only issue in core so far is FilteredQuery.RANDOM_ACCESS_STRATEGY which wants to apply a Filter as acceptDocs, which would not be supported anymore with this change.
          Hide
          Adrien Grand added a comment -

          Here is a patch that removes the handling of acceptDocs from the postings, spans and scorer APIs, and moves it from the constructor to the score() method for BulkScorer.

          In general I think it simplifies the code a lot:

          • we have lots of postings formats and query impls that do not need to care about deleted docs at all anymore since they use the default bulk scorer
          • CheckIndex does not need to test that postings formats ignore deleted docs correctly

          One thing I am unsure about is whether LeafReader.postings should still apply deleted docs or not. At least for other call sites, there would be a compilation error since the acceptDocs parameter was removed, but this method did not have such a parameter and implicitely applied the reader's live docs. For now I documented explicitly that live docs were not applied, but I could also understand why someone would like to see live docs applied for this method. The reason why I decided to not apply live docs is that then if you use this method in a Query implementation, the Scorer would be illegal since it would apply live docs while it's not supposed to.

          Show
          Adrien Grand added a comment - Here is a patch that removes the handling of acceptDocs from the postings, spans and scorer APIs, and moves it from the constructor to the score() method for BulkScorer. In general I think it simplifies the code a lot: we have lots of postings formats and query impls that do not need to care about deleted docs at all anymore since they use the default bulk scorer CheckIndex does not need to test that postings formats ignore deleted docs correctly One thing I am unsure about is whether LeafReader.postings should still apply deleted docs or not. At least for other call sites, there would be a compilation error since the acceptDocs parameter was removed, but this method did not have such a parameter and implicitely applied the reader's live docs. For now I documented explicitly that live docs were not applied, but I could also understand why someone would like to see live docs applied for this method. The reason why I decided to not apply live docs is that then if you use this method in a Query implementation, the Scorer would be illegal since it would apply live docs while it's not supposed to.
          Hide
          David Smiley added a comment -

          +1 LGTM.

          Wow the patch is a doozy. At least mostly it's changing parameters at call sites. For the most part the patch seems to reduce the lines of code (a good thing), particularly at the low-level postings format parts, so I think that's good.

          One thing I am unsure about is whether LeafReader.postings should still apply deleted docs or not.

          Your chosen approach makes sense to me. If it were to do it, it takes away control of wether they should be filtered or not.

          Show
          David Smiley added a comment - +1 LGTM. Wow the patch is a doozy. At least mostly it's changing parameters at call sites. For the most part the patch seems to reduce the lines of code (a good thing), particularly at the low-level postings format parts, so I think that's good. One thing I am unsure about is whether LeafReader.postings should still apply deleted docs or not. Your chosen approach makes sense to me. If it were to do it, it takes away control of wether they should be filtered or not.
          Hide
          Adrien Grand added a comment -

          I ran luceneutil on wikimedium10M with deleted documents:

                              TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                            Fuzzy1       91.66      (7.4%)       89.66      (7.3%)   -2.2% ( -15% -   13%)
                      OrHighNotLow       62.10      (3.1%)       61.14      (4.8%)   -1.5% (  -9% -    6%)
                       LowSpanNear       27.30      (3.3%)       26.88      (4.6%)   -1.5% (  -9% -    6%)
                      OrHighNotMed       38.74      (2.8%)       38.31      (4.6%)   -1.1% (  -8% -    6%)
                  HighSloppyPhrase        3.23      (4.7%)        3.20      (4.6%)   -1.0% (  -9% -    8%)
                   MedSloppyPhrase       54.01      (2.6%)       53.55      (2.2%)   -0.9% (  -5% -    4%)
                   LowSloppyPhrase       41.78      (2.3%)       41.50      (2.5%)   -0.7% (  -5% -    4%)
                         LowPhrase       15.75      (1.2%)       15.68      (2.1%)   -0.4% (  -3% -    2%)
                         MedPhrase       14.62      (1.5%)       14.58      (2.4%)   -0.3% (  -4% -    3%)
                        HighPhrase       20.86      (3.1%)       20.86      (4.0%)   -0.0% (  -6% -    7%)
                           Respell       94.55      (4.8%)       94.58      (4.5%)    0.0% (  -8% -    9%)
                          Wildcard       60.39      (4.4%)       60.49      (3.9%)    0.2% (  -7% -    8%)
                     OrHighNotHigh       33.38      (1.7%)       33.52      (3.3%)    0.4% (  -4% -    5%)
                      HighSpanNear        8.55      (2.4%)        8.61      (3.0%)    0.8% (  -4% -    6%)
                      OrNotHighMed      211.67      (1.6%)      214.27      (2.3%)    1.2% (  -2% -    5%)
                     OrNotHighHigh       63.32      (1.6%)       64.12      (3.1%)    1.3% (  -3% -    6%)
                      OrNotHighLow     1031.92      (3.7%)     1045.97      (4.7%)    1.4% (  -6% -   10%)
                          HighTerm      141.43      (3.7%)      143.85      (4.9%)    1.7% (  -6% -   10%)
                       MedSpanNear       34.47      (2.0%)       35.07      (2.1%)    1.7% (  -2% -    5%)
                           MedTerm      208.01      (3.6%)      211.76      (4.8%)    1.8% (  -6% -   10%)
                        AndHighLow      819.26      (5.5%)      842.89      (5.5%)    2.9% (  -7% -   14%)
                        AndHighMed      203.53      (2.2%)      210.03      (2.1%)    3.2% (  -1% -    7%)
                            IntNRQ       14.08      (8.1%)       14.59     (10.1%)    3.6% ( -13% -   23%)
                           Prefix3       41.82      (7.6%)       43.52      (8.8%)    4.1% ( -11% -   22%)
                       AndHighHigh       47.54      (1.9%)       49.68      (2.2%)    4.5% (   0% -    8%)
                         OrHighMed       71.76      (5.4%)       76.11      (4.9%)    6.1% (  -4% -   17%)
                           LowTerm      654.52      (9.3%)      695.50     (10.3%)    6.3% ( -12% -   28%)
                         OrHighLow       67.44      (5.4%)       72.46      (5.0%)    7.4% (  -2% -   18%)
                        OrHighHigh       26.92      (5.8%)       28.95      (5.4%)    7.5% (  -3% -   19%)
                            Fuzzy2       81.71     (22.1%)       96.27     (18.1%)   17.8% ( -18% -   74%)
          

          For most queries performance is similar, but disjunctions look like they got a slight peformance boost with this patch.

          Show
          Adrien Grand added a comment - I ran luceneutil on wikimedium10M with deleted documents: TaskQPS baseline StdDev QPS patch StdDev Pct diff Fuzzy1 91.66 (7.4%) 89.66 (7.3%) -2.2% ( -15% - 13%) OrHighNotLow 62.10 (3.1%) 61.14 (4.8%) -1.5% ( -9% - 6%) LowSpanNear 27.30 (3.3%) 26.88 (4.6%) -1.5% ( -9% - 6%) OrHighNotMed 38.74 (2.8%) 38.31 (4.6%) -1.1% ( -8% - 6%) HighSloppyPhrase 3.23 (4.7%) 3.20 (4.6%) -1.0% ( -9% - 8%) MedSloppyPhrase 54.01 (2.6%) 53.55 (2.2%) -0.9% ( -5% - 4%) LowSloppyPhrase 41.78 (2.3%) 41.50 (2.5%) -0.7% ( -5% - 4%) LowPhrase 15.75 (1.2%) 15.68 (2.1%) -0.4% ( -3% - 2%) MedPhrase 14.62 (1.5%) 14.58 (2.4%) -0.3% ( -4% - 3%) HighPhrase 20.86 (3.1%) 20.86 (4.0%) -0.0% ( -6% - 7%) Respell 94.55 (4.8%) 94.58 (4.5%) 0.0% ( -8% - 9%) Wildcard 60.39 (4.4%) 60.49 (3.9%) 0.2% ( -7% - 8%) OrHighNotHigh 33.38 (1.7%) 33.52 (3.3%) 0.4% ( -4% - 5%) HighSpanNear 8.55 (2.4%) 8.61 (3.0%) 0.8% ( -4% - 6%) OrNotHighMed 211.67 (1.6%) 214.27 (2.3%) 1.2% ( -2% - 5%) OrNotHighHigh 63.32 (1.6%) 64.12 (3.1%) 1.3% ( -3% - 6%) OrNotHighLow 1031.92 (3.7%) 1045.97 (4.7%) 1.4% ( -6% - 10%) HighTerm 141.43 (3.7%) 143.85 (4.9%) 1.7% ( -6% - 10%) MedSpanNear 34.47 (2.0%) 35.07 (2.1%) 1.7% ( -2% - 5%) MedTerm 208.01 (3.6%) 211.76 (4.8%) 1.8% ( -6% - 10%) AndHighLow 819.26 (5.5%) 842.89 (5.5%) 2.9% ( -7% - 14%) AndHighMed 203.53 (2.2%) 210.03 (2.1%) 3.2% ( -1% - 7%) IntNRQ 14.08 (8.1%) 14.59 (10.1%) 3.6% ( -13% - 23%) Prefix3 41.82 (7.6%) 43.52 (8.8%) 4.1% ( -11% - 22%) AndHighHigh 47.54 (1.9%) 49.68 (2.2%) 4.5% ( 0% - 8%) OrHighMed 71.76 (5.4%) 76.11 (4.9%) 6.1% ( -4% - 17%) LowTerm 654.52 (9.3%) 695.50 (10.3%) 6.3% ( -12% - 28%) OrHighLow 67.44 (5.4%) 72.46 (5.0%) 7.4% ( -2% - 18%) OrHighHigh 26.92 (5.8%) 28.95 (5.4%) 7.5% ( -3% - 19%) Fuzzy2 81.71 (22.1%) 96.27 (18.1%) 17.8% ( -18% - 74%) For most queries performance is similar, but disjunctions look like they got a slight peformance boost with this patch.
          Hide
          Adrien Grand added a comment -

          Luceneutil on wikimedium10M again, but without deleted documents this time:

                            IntNRQ        9.57      (5.8%)        9.31      (6.6%)   -2.7% ( -14% -   10%)
                           Prefix3      253.58      (3.5%)      249.27      (3.4%)   -1.7% (  -8% -    5%)
                           LowTerm      695.13      (2.9%)      685.91      (2.9%)   -1.3% (  -6% -    4%)
                          Wildcard       51.13      (3.6%)       50.49      (4.3%)   -1.3% (  -8% -    6%)
                   LowSloppyPhrase       13.87      (5.3%)       13.71      (5.4%)   -1.1% ( -11% -   10%)
                         MedPhrase       99.70      (3.2%)       98.69      (4.3%)   -1.0% (  -8% -    6%)
                            Fuzzy1       86.60     (11.0%)       85.75     (11.0%)   -1.0% ( -20% -   23%)
                           Respell      103.93      (3.3%)      103.18      (3.5%)   -0.7% (  -7% -    6%)
                  HighSloppyPhrase        8.18      (5.6%)        8.13      (5.9%)   -0.7% ( -11% -   11%)
                         OrHighLow       55.24      (6.4%)       54.90      (6.9%)   -0.6% ( -13% -   13%)
                        HighPhrase        8.42      (5.9%)        8.37      (6.4%)   -0.6% ( -12% -   12%)
                         OrHighMed       19.64      (6.4%)       19.52      (7.2%)   -0.6% ( -13% -   13%)
                         LowPhrase       58.69      (2.2%)       58.34      (2.4%)   -0.6% (  -5% -    4%)
                   MedSloppyPhrase       43.44      (5.4%)       43.21      (5.3%)   -0.5% ( -10% -   10%)
                        OrHighHigh       39.31      (6.5%)       39.14      (6.9%)   -0.4% ( -12% -   13%)
                        AndHighLow      690.71      (5.1%)      688.77      (4.3%)   -0.3% (  -9% -    9%)
                      OrNotHighMed      153.25      (1.8%)      152.97      (1.9%)   -0.2% (  -3% -    3%)
                       AndHighHigh       65.10      (2.6%)       65.08      (3.2%)   -0.0% (  -5% -    5%)
                     OrNotHighHigh       46.47      (1.4%)       46.47      (1.9%)   -0.0% (  -3% -    3%)
                        AndHighMed      168.75      (2.3%)      168.79      (2.2%)    0.0% (  -4% -    4%)
                       MedSpanNear       61.15      (3.9%)       61.41      (3.5%)    0.4% (  -6% -    8%)
                      OrNotHighLow     1137.12      (4.0%)     1142.11      (3.5%)    0.4% (  -6% -    8%)
                     OrHighNotHigh       54.49      (1.7%)       54.74      (1.9%)    0.5% (  -3% -    4%)
                       LowSpanNear       14.95      (2.8%)       15.02      (2.9%)    0.5% (  -5% -    6%)
                      OrHighNotMed       41.44      (2.5%)       41.73      (2.6%)    0.7% (  -4% -    5%)
                           MedTerm      289.16      (3.5%)      292.24      (2.9%)    1.1% (  -5% -    7%)
                      OrHighNotLow       87.80      (3.3%)       88.86      (3.1%)    1.2% (  -5% -    7%)
                          HighTerm       81.86      (3.9%)       83.56      (3.5%)    2.1% (  -5% -    9%)
                      HighSpanNear       42.21      (3.5%)       43.33      (4.2%)    2.6% (  -4% -   10%)
                            Fuzzy2       58.86     (15.6%)       60.45      (9.4%)    2.7% ( -19% -   32%)
          

          All differences look like noise to me?

          Show
          Adrien Grand added a comment - Luceneutil on wikimedium10M again, but without deleted documents this time: IntNRQ 9.57 (5.8%) 9.31 (6.6%) -2.7% ( -14% - 10%) Prefix3 253.58 (3.5%) 249.27 (3.4%) -1.7% ( -8% - 5%) LowTerm 695.13 (2.9%) 685.91 (2.9%) -1.3% ( -6% - 4%) Wildcard 51.13 (3.6%) 50.49 (4.3%) -1.3% ( -8% - 6%) LowSloppyPhrase 13.87 (5.3%) 13.71 (5.4%) -1.1% ( -11% - 10%) MedPhrase 99.70 (3.2%) 98.69 (4.3%) -1.0% ( -8% - 6%) Fuzzy1 86.60 (11.0%) 85.75 (11.0%) -1.0% ( -20% - 23%) Respell 103.93 (3.3%) 103.18 (3.5%) -0.7% ( -7% - 6%) HighSloppyPhrase 8.18 (5.6%) 8.13 (5.9%) -0.7% ( -11% - 11%) OrHighLow 55.24 (6.4%) 54.90 (6.9%) -0.6% ( -13% - 13%) HighPhrase 8.42 (5.9%) 8.37 (6.4%) -0.6% ( -12% - 12%) OrHighMed 19.64 (6.4%) 19.52 (7.2%) -0.6% ( -13% - 13%) LowPhrase 58.69 (2.2%) 58.34 (2.4%) -0.6% ( -5% - 4%) MedSloppyPhrase 43.44 (5.4%) 43.21 (5.3%) -0.5% ( -10% - 10%) OrHighHigh 39.31 (6.5%) 39.14 (6.9%) -0.4% ( -12% - 13%) AndHighLow 690.71 (5.1%) 688.77 (4.3%) -0.3% ( -9% - 9%) OrNotHighMed 153.25 (1.8%) 152.97 (1.9%) -0.2% ( -3% - 3%) AndHighHigh 65.10 (2.6%) 65.08 (3.2%) -0.0% ( -5% - 5%) OrNotHighHigh 46.47 (1.4%) 46.47 (1.9%) -0.0% ( -3% - 3%) AndHighMed 168.75 (2.3%) 168.79 (2.2%) 0.0% ( -4% - 4%) MedSpanNear 61.15 (3.9%) 61.41 (3.5%) 0.4% ( -6% - 8%) OrNotHighLow 1137.12 (4.0%) 1142.11 (3.5%) 0.4% ( -6% - 8%) OrHighNotHigh 54.49 (1.7%) 54.74 (1.9%) 0.5% ( -3% - 4%) LowSpanNear 14.95 (2.8%) 15.02 (2.9%) 0.5% ( -5% - 6%) OrHighNotMed 41.44 (2.5%) 41.73 (2.6%) 0.7% ( -4% - 5%) MedTerm 289.16 (3.5%) 292.24 (2.9%) 1.1% ( -5% - 7%) OrHighNotLow 87.80 (3.3%) 88.86 (3.1%) 1.2% ( -5% - 7%) HighTerm 81.86 (3.9%) 83.56 (3.5%) 2.1% ( -5% - 9%) HighSpanNear 42.21 (3.5%) 43.33 (4.2%) 2.6% ( -4% - 10%) Fuzzy2 58.86 (15.6%) 60.45 (9.4%) 2.7% ( -19% - 32%) All differences look like noise to me?
          Hide
          Robert Muir added a comment -

          +1

          Show
          Robert Muir added a comment - +1
          Hide
          Adrien Grand added a comment -

          Since it only modifies internal APIs, we could actually have this changed backported to 5.x. However, it would conflict with how FilteredQuery sometimes applies filters through acceptDocs so we would have to make FilteredQuery always rewrite to a BooleanQuery. I think it is fair, but it means that FilteredQuery will be slower when provided with a filter which has good random-access but a bad iterator.

          Opinions?

          Show
          Adrien Grand added a comment - Since it only modifies internal APIs, we could actually have this changed backported to 5.x. However, it would conflict with how FilteredQuery sometimes applies filters through acceptDocs so we would have to make FilteredQuery always rewrite to a BooleanQuery. I think it is fair, but it means that FilteredQuery will be slower when provided with a filter which has good random-access but a bad iterator. Opinions?
          Hide
          Robert Muir added a comment -

          Maybe FilteredQuery can apply filters through acceptDocs by wrapping the Bits with a two-phase DISI. Then iteration will not be used just like today.

          Show
          Robert Muir added a comment - Maybe FilteredQuery can apply filters through acceptDocs by wrapping the Bits with a two-phase DISI. Then iteration will not be used just like today.
          Hide
          Adrien Grand added a comment -

          This is a good idea, I opened LUCENE-6601.

          Show
          Adrien Grand added a comment - This is a good idea, I opened LUCENE-6601 .
          Hide
          Adrien Grand added a comment -

          I just committed LUCENE-6601 so I'll commit this change to the 5.x branch as well.

          Show
          Adrien Grand added a comment - I just committed LUCENE-6601 so I'll commit this change to the 5.x branch as well.
          Hide
          ASF subversion and git services added a comment -

          Commit 1687524 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1687524 ]

          LUCENE-6553: Simplify how live docs are applied.

          Show
          ASF subversion and git services added a comment - Commit 1687524 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1687524 ] LUCENE-6553 : Simplify how live docs are applied.
          Hide
          ASF subversion and git services added a comment -

          Commit 1687580 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1687580 ]

          LUCENE-6553: Fix how DrillSidewaysScorer handles deleted docs.

          Show
          ASF subversion and git services added a comment - Commit 1687580 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1687580 ] LUCENE-6553 : Fix how DrillSidewaysScorer handles deleted docs.
          Hide
          ASF subversion and git services added a comment -

          Commit 1687581 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1687581 ]

          LUCENE-6553: Simplify how live docs are applied.

          Show
          ASF subversion and git services added a comment - Commit 1687581 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687581 ] LUCENE-6553 : Simplify how live docs are applied.
          Hide
          ASF subversion and git services added a comment -

          Commit 1687586 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1687586 ]

          LUCENE-6553: Fixes to Weight's javadoc.

          Show
          ASF subversion and git services added a comment - Commit 1687586 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1687586 ] LUCENE-6553 : Fixes to Weight's javadoc.
          Hide
          ASF subversion and git services added a comment -

          Commit 1687712 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1687712 ]

          LUCENE-6553: Fix reuse of postings enums in DirectPostingsFormat.

          Show
          ASF subversion and git services added a comment - Commit 1687712 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1687712 ] LUCENE-6553 : Fix reuse of postings enums in DirectPostingsFormat.
          Hide
          ASF subversion and git services added a comment -

          Commit 1687722 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1687722 ]

          LUCENE-6553: Fix reuse of postings enums in DirectPostingsFormat.

          Show
          ASF subversion and git services added a comment - Commit 1687722 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687722 ] LUCENE-6553 : Fix reuse of postings enums in DirectPostingsFormat.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development