Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Two-phase support currently works for both sloppy and exact Scorers but it does not work if you have multiple terms at the same position (MultiPhraseQuery).

      This is because UnionPostingsEnum.nextDoc() aggressively reads and merges all the positions. Even making this initialization lazy might just be enough?

      1. LUCENE-6421_luceneutil.patch
        3 kB
        Robert Muir
      2. LUCENE-6421.patch
        12 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        See attached patch and benchmarks modifications / tasks file.

        • no longer keeps subs "one document ahead", its like a normal disjunction
        • positions reading/merging are deferred until freq() is called.
        • general cleanups

        The problems with the current code is more than just two-phase iteration, because it always reads all positions from all subs on nextDoc()/advance(), it slows down even the simplest multiphrase queries like these added to the tasks file:

        MultiPhraseHHH: multiPhrase//(body:in|of the)
        MultiPhraseHHM: multiPhrase//(body:in|of your)
        MultiPhraseHHL: multiPhrase//(body:in|of harvard)
        MultiPhraseMMH: multiPhrase//(body:northern|southern states)
        MultiPhraseMMM: multiPhrase//(body:northern|southern usa)
        MultiPhraseMML: multiPhrase//(body:northern|southern iraq)
        

        So in the example of northern|southern states, today all positions are read from either or both 'northern' and 'southern', regardless of whether 'states' is present in the doc at all. Filters will only aggravate the situation even more.

        Benchmarking these is super-slow, but after a few iterations it looks like this:

                            Task   QPS trunk      StdDev   QPS patch      StdDev                Pct diff
                  MultiPhraseHHH        0.34      (2.1%)        0.33      (1.4%)   -2.1% (  -5% -    1%)
                  MultiPhraseHHL       17.26      (0.7%)       17.67      (0.5%)    2.3% (   1% -    3%)
                  MultiPhraseHHM        5.13      (1.6%)        5.34      (0.3%)    4.1% (   2% -    6%)
                  MultiPhraseMMH       33.99      (1.3%)       39.19      (0.7%)   15.3% (  13% -   17%)
                  MultiPhraseMML      160.11      (0.2%)      202.29      (0.6%)   26.3% (  25% -   27%)
                  MultiPhraseMMM       72.20      (1.7%)       95.66      (2.0%)   32.5% (  28% -   36%)
        
        Show
        Robert Muir added a comment - See attached patch and benchmarks modifications / tasks file. no longer keeps subs "one document ahead", its like a normal disjunction positions reading/merging are deferred until freq() is called. general cleanups The problems with the current code is more than just two-phase iteration, because it always reads all positions from all subs on nextDoc()/advance(), it slows down even the simplest multiphrase queries like these added to the tasks file: MultiPhraseHHH: multiPhrase//(body:in|of the) MultiPhraseHHM: multiPhrase//(body:in|of your) MultiPhraseHHL: multiPhrase//(body:in|of harvard) MultiPhraseMMH: multiPhrase//(body:northern|southern states) MultiPhraseMMM: multiPhrase//(body:northern|southern usa) MultiPhraseMML: multiPhrase//(body:northern|southern iraq) So in the example of northern|southern states, today all positions are read from either or both 'northern' and 'southern', regardless of whether 'states' is present in the doc at all. Filters will only aggravate the situation even more. Benchmarking these is super-slow, but after a few iterations it looks like this: Task QPS trunk StdDev QPS patch StdDev Pct diff MultiPhraseHHH 0.34 (2.1%) 0.33 (1.4%) -2.1% ( -5% - 1%) MultiPhraseHHL 17.26 (0.7%) 17.67 (0.5%) 2.3% ( 1% - 3%) MultiPhraseHHM 5.13 (1.6%) 5.34 (0.3%) 4.1% ( 2% - 6%) MultiPhraseMMH 33.99 (1.3%) 39.19 (0.7%) 15.3% ( 13% - 17%) MultiPhraseMML 160.11 (0.2%) 202.29 (0.6%) 26.3% ( 25% - 27%) MultiPhraseMMM 72.20 (1.7%) 95.66 (2.0%) 32.5% ( 28% - 36%)
        Hide
        Robert Muir added a comment -

        final result:

        Report after iter 19:
        Chart saved to out.png... (wd: /home/rmuir/workspace/util/src/python)
                            Task   QPS trunk      StdDev   QPS patch      StdDev                Pct diff
                  MultiPhraseHHH        0.35      (6.3%)        0.35      (5.7%)    2.2% (  -9% -   15%)
                  MultiPhraseHHL       17.21      (1.8%)       17.65      (1.3%)    2.6% (   0% -    5%)
                  MultiPhraseHHM        5.12      (2.8%)        5.42      (1.9%)    5.7% (   1% -   10%)
                  MultiPhraseMMH       33.71      (2.9%)       39.70      (1.2%)   17.8% (  13% -   22%)
                  MultiPhraseMML      160.35      (1.5%)      201.54      (1.4%)   25.7% (  22% -   29%)
                  MultiPhraseMMM       72.26      (2.2%)       95.67      (2.0%)   32.4% (  27% -   37%)
        
        Show
        Robert Muir added a comment - final result: Report after iter 19: Chart saved to out.png... (wd: /home/rmuir/workspace/util/src/python) Task QPS trunk StdDev QPS patch StdDev Pct diff MultiPhraseHHH 0.35 (6.3%) 0.35 (5.7%) 2.2% ( -9% - 15%) MultiPhraseHHL 17.21 (1.8%) 17.65 (1.3%) 2.6% ( 0% - 5%) MultiPhraseHHM 5.12 (2.8%) 5.42 (1.9%) 5.7% ( 1% - 10%) MultiPhraseMMH 33.71 (2.9%) 39.70 (1.2%) 17.8% ( 13% - 22%) MultiPhraseMML 160.35 (1.5%) 201.54 (1.4%) 25.7% ( 22% - 29%) MultiPhraseMMM 72.26 (2.2%) 95.67 (2.0%) 32.4% ( 27% - 37%)
        Hide
        Adrien Grand added a comment -

        +1 this looks much cleaner already

        Show
        Adrien Grand added a comment - +1 this looks much cleaner already
        Hide
        ASF subversion and git services added a comment -

        Commit 1673572 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1673572 ]

        LUCENE-6421: defer reading of positions in MultiPhraseQuery until they are needed

        Show
        ASF subversion and git services added a comment - Commit 1673572 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1673572 ] LUCENE-6421 : defer reading of positions in MultiPhraseQuery until they are needed
        Hide
        ASF subversion and git services added a comment -

        Commit 1673575 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1673575 ]

        LUCENE-6421: defer reading of positions in MultiPhraseQuery until they are needed

        Show
        ASF subversion and git services added a comment - Commit 1673575 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673575 ] LUCENE-6421 : defer reading of positions in MultiPhraseQuery until they are needed
        Hide
        ASF subversion and git services added a comment -

        Commit 1673576 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1673576 ]

        LUCENE-6421: revert

        Show
        ASF subversion and git services added a comment - Commit 1673576 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673576 ] LUCENE-6421 : revert
        Hide
        ASF subversion and git services added a comment -

        Commit 1673577 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1673577 ]

        LUCENE-6421: revert

        Show
        ASF subversion and git services added a comment - Commit 1673577 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1673577 ] LUCENE-6421 : revert
        Hide
        Robert Muir added a comment -

        The bug was silly. MultiPhraseWeight would 'return null' if one term didn't exist in the segment, but that code is dead (we just ignore it and check if the list of postings is empty at the end).

        I also added some unit tests for the postings enum (which didnt have bugs, but still good to have direct tests).

        Show
        Robert Muir added a comment - The bug was silly. MultiPhraseWeight would 'return null' if one term didn't exist in the segment, but that code is dead (we just ignore it and check if the list of postings is empty at the end). I also added some unit tests for the postings enum (which didnt have bugs, but still good to have direct tests).
        Hide
        ASF subversion and git services added a comment -

        Commit 1673595 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1673595 ]

        LUCENE-6421: defer reading of positions in MultiPhraseQuery until they are needed

        Show
        ASF subversion and git services added a comment - Commit 1673595 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1673595 ] LUCENE-6421 : defer reading of positions in MultiPhraseQuery until they are needed
        Hide
        ASF subversion and git services added a comment -

        Commit 1673597 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1673597 ]

        LUCENE-6421: defer reading of positions in MultiPhraseQuery until they are needed

        Show
        ASF subversion and git services added a comment - Commit 1673597 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1673597 ] LUCENE-6421 : defer reading of positions in MultiPhraseQuery until they are needed
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development