Lucene - Core
  1. Lucene - Core
  2. LUCENE-6585

Make ConjunctionDISI flatten sub ConjunctionDISI instances

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today ConjunctionDISI wraps some sub (two-phase) iterators. I would like to improve it by flattening sub iterators when they implement ConjunctionDISI. In practice, this would make "+A +(+B +C)" be executed more like "+A +B +C" (only in terms of matching, scoring would not change).

      My motivation for this is that if we don't flatten and are unlucky, we can sometimes hit some worst cases. For instance consider the 3 following postings lists (sorted by increasing cost):

      A: 1, 1001, 2001, 3001, ...
      C: 0, 2, 4, 6, 8, 10, 12, 14, ...
      B: 1, 3, 5, 7, 9, 11, 13, 15, ...

      If we run "+A +B +C", then everything works fine, we use A as a lead, and advance B 1000 by 1000 to find the next match (if any).

      However if we run "+A +(+B +C)", then we would iterate B and C 2 by 2 over the entire doc ID space when trying to find the first match which occurs on or after A:1.

      This is an extreme example which is unlikely to happen in practice, but flattening would also help a bit on some more common cases. For instance imagine that A, B and C have respective costs of 100, 10 and 1000. If you search for "+A +(+B +C)", then we will use the most costly iterator (C) to confirm matches of B (the least costly iterator, used as a lead) while it would have been more efficient to confirm matches of B with A first, since A is less costly than C.

      1. LUCENE-6585.patch
        6 kB
        Ryan Ernst
      2. LUCENE-6585.patch
        5 kB
        Ryan Ernst

        Activity

        Hide
        Ryan Ernst added a comment -

        Here is a patch collapsing all subconjunctions (thanks Adrien Grand for the help with twophase).

        Show
        Ryan Ernst added a comment - Here is a patch collapsing all subconjunctions (thanks Adrien Grand for the help with twophase).
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Robert Muir added a comment -

        My primary concern is if the flattening optimization is safe.

        The test is great for testing this DISI itself, but some of our scorers are complicated and have things like coord() at play.

        So it would be great to see some simple nested boolean cases added to TestBooleanCoord, so we know the correct scores are still produced when DISIs are flattened. Maybe we should also add "conjunction-of-phrases" tests to *SearchEquivalence to know that phrase/sloppyphrase/spans/etc are ok with the change too.

        Show
        Robert Muir added a comment - My primary concern is if the flattening optimization is safe. The test is great for testing this DISI itself, but some of our scorers are complicated and have things like coord() at play. So it would be great to see some simple nested boolean cases added to TestBooleanCoord, so we know the correct scores are still produced when DISIs are flattened. Maybe we should also add "conjunction-of-phrases" tests to *SearchEquivalence to know that phrase/sloppyphrase/spans/etc are ok with the change too.
        Hide
        Paul Elschot added a comment -

        I checked that recursion is indeed not needed, ConjunctionDISI is always built bottom up.

        Show
        Paul Elschot added a comment - I checked that recursion is indeed not needed, ConjunctionDISI is always built bottom up.
        Hide
        Adrien Grand added a comment -

        Paul, your comment confused me: where do you see recursion in the patch?

        Show
        Adrien Grand added a comment - Paul, your comment confused me: where do you see recursion in the patch?
        Hide
        Paul Elschot added a comment -

        Sorry about that. The patch has no recursion, and does not need it.

        Show
        Paul Elschot added a comment - Sorry about that. The patch has no recursion, and does not need it.
        Hide
        ASF subversion and git services added a comment -

        Commit 1687400 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1687400 ]

        LUCENE-6585: improve TestBooleanCoord to better test coord handling

        Show
        ASF subversion and git services added a comment - Commit 1687400 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1687400 ] LUCENE-6585 : improve TestBooleanCoord to better test coord handling
        Hide
        ASF subversion and git services added a comment -

        Commit 1687402 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1687402 ]

        LUCENE-6585: improve TestBooleanCoord to better test coord handling

        Show
        ASF subversion and git services added a comment - Commit 1687402 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687402 ] LUCENE-6585 : improve TestBooleanCoord to better test coord handling
        Hide
        ASF subversion and git services added a comment -

        Commit 1687404 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1687404 ]

        LUCENE-6585: add some tests that coord is applied properly for nested conjunctions

        Show
        ASF subversion and git services added a comment - Commit 1687404 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1687404 ] LUCENE-6585 : add some tests that coord is applied properly for nested conjunctions
        Hide
        ASF subversion and git services added a comment -

        Commit 1687406 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1687406 ]

        LUCENE-6585: add some tests that coord is applied properly for nested conjunctions

        Show
        ASF subversion and git services added a comment - Commit 1687406 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687406 ] LUCENE-6585 : add some tests that coord is applied properly for nested conjunctions
        Hide
        Robert Muir added a comment -

        I added lots of coord() tests, and changed it to test a crazy coord(), the patch here is fine actually. It is difficult to see that its correct, but the tests help.
        I still opened LUCENE-6603 regardless, this stuff is really tricky and we should try to simplify life at some point.

        Separately my concerns around 2-phase iterators (phrase/multiphrase) should not be an issue, unless they already have bugs with the two-phase API. I will look into this now.

        But overall I think the patch here is good. We should probably add a comment to collapseSubs so that its a little more obvious:
        We collapse the sub-iterators (removing the useless conjunction "in between") but the two-phase views are also kept. So everything works because of guarantees in the api.

        I don't like the instanceof ConjunctionDISI check, with the assert false. The reason is that this class is not final. I would rather it be an explicit class check on getClass() == ConjunctionDISI.class or TwoPhase.class.

        Show
        Robert Muir added a comment - I added lots of coord() tests, and changed it to test a crazy coord(), the patch here is fine actually. It is difficult to see that its correct, but the tests help. I still opened LUCENE-6603 regardless, this stuff is really tricky and we should try to simplify life at some point. Separately my concerns around 2-phase iterators (phrase/multiphrase) should not be an issue, unless they already have bugs with the two-phase API. I will look into this now. But overall I think the patch here is good. We should probably add a comment to collapseSubs so that its a little more obvious: We collapse the sub-iterators (removing the useless conjunction "in between") but the two-phase views are also kept. So everything works because of guarantees in the api. I don't like the instanceof ConjunctionDISI check, with the assert false. The reason is that this class is not final. I would rather it be an explicit class check on getClass() == ConjunctionDISI.class or TwoPhase.class.
        Hide
        Ryan Ernst added a comment -

        Here's a new patch with instanceof replaced with ==. I also tried improving naming and comments a little.

        Show
        Ryan Ernst added a comment - Here's a new patch with instanceof replaced with == . I also tried improving naming and comments a little.
        Hide
        Robert Muir added a comment -

        Thanks, I am +1 to commit.

        Any "problems" are going to be problems in other scorers not obeying API contracts and not the logic here, so we don't need to hold up the change on improving testing for those IMO, but I still want to improve those tests eventually, it can be a followup. Its just good in general to assert more things about scoring so its less fragile.

        Show
        Robert Muir added a comment - Thanks, I am +1 to commit. Any "problems" are going to be problems in other scorers not obeying API contracts and not the logic here, so we don't need to hold up the change on improving testing for those IMO, but I still want to improve those tests eventually, it can be a followup. Its just good in general to assert more things about scoring so its less fragile.
        Hide
        ASF subversion and git services added a comment -

        Commit 1687408 from Ryan Ernst in branch 'dev/trunk'
        [ https://svn.apache.org/r1687408 ]

        LUCENE-6585: Flatten conjunctions and conjunction approximations into parent conjunctions

        Show
        ASF subversion and git services added a comment - Commit 1687408 from Ryan Ernst in branch 'dev/trunk' [ https://svn.apache.org/r1687408 ] LUCENE-6585 : Flatten conjunctions and conjunction approximations into parent conjunctions
        Hide
        ASF subversion and git services added a comment -

        Commit 1687412 from Ryan Ernst in branch 'dev/trunk'
        [ https://svn.apache.org/r1687412 ]

        LUCENE-6585: Remove println from testing

        Show
        ASF subversion and git services added a comment - Commit 1687412 from Ryan Ernst in branch 'dev/trunk' [ https://svn.apache.org/r1687412 ] LUCENE-6585 : Remove println from testing
        Hide
        ASF subversion and git services added a comment - - edited

        Commit 1687413 from Ryan Ernst in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1687413 ]

        LUCENE-6585: Flatten conjunctions and conjunction approximations into parent conjunctions (merged from r1687408, r1687412)

        Show
        ASF subversion and git services added a comment - - edited Commit 1687413 from Ryan Ernst in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687413 ] LUCENE-6585 : Flatten conjunctions and conjunction approximations into parent conjunctions (merged from r1687408, r1687412)
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Adrien Grand
          • Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development