Lucene - Core
  1. Lucene - Core
  2. LUCENE-6926

Take matchCost into account for MUST_NOT clauses

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ReqExclScorer potentially has two TwoPhaseIterators to check: the one for the positive clause and the one for the negative clause. It should leverage the match cost API to check the least costly one first.

      1. LUCENE-6926.patch
        7 kB
        Adrien Grand
      2. LUCENE-6926.patch
        6 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch which changes several things:

        • ReqExclScorer now always exposes a two phase iterator, using the approximation of the positive clause as an approximation. This is useful because advancing to a non-excluded doc can be heavy if the excluded clause matches lots of documents. It also makes Lucene check deleted docs before we check the excluded clause.
        • In case ReqExclScorer needs to confirm both two phase iterators, it will confirm the least-costly one first.
        • The matchCost() now includes the match cost of the excluded clause and tries to take into account the cost of advance (by using an arbitrary constant, but this is probably better than nothing?)
        Show
        Adrien Grand added a comment - Here is a patch which changes several things: ReqExclScorer now always exposes a two phase iterator, using the approximation of the positive clause as an approximation. This is useful because advancing to a non-excluded doc can be heavy if the excluded clause matches lots of documents. It also makes Lucene check deleted docs before we check the excluded clause. In case ReqExclScorer needs to confirm both two phase iterators, it will confirm the least-costly one first. The matchCost() now includes the match cost of the excluded clause and tries to take into account the cost of advance (by using an arbitrary constant, but this is probably better than nothing?)
        Hide
        Paul Elschot added a comment -

        First impression is good, and core tests pass here.
        I don't understand everything immediately though, but don't wait for that.

        Show
        Paul Elschot added a comment - First impression is good, and core tests pass here. I don't understand everything immediately though, but don't wait for that.
        Hide
        Adrien Grand added a comment -

        If anything is not clear, feel free to ask! This might point to something that I overlooked or should document better.

        Show
        Adrien Grand added a comment - If anything is not clear, feel free to ask! This might point to something that I overlooked or should document better.
        Hide
        Paul Elschot added a comment -

        The patch uses only matchCost of req and excl, but it might be better to also use their costs as a weight, just like a conjunction. The patch uses equal weights for req and excl.

        Here the conjunction is (req and (not excl)), and for weighting their costs are req.cost() and (N - excl.cost()), with N as the number of docs in the segment.
        This would boil down to normally using req first, as expected. Only when excl matches a lot of docs, its weighting cost would be lower.

        Would it be worthwhile to use such weights here?
        See also LUCENE-6894 where the number of docs in a segment is used much more.

        Show
        Paul Elschot added a comment - The patch uses only matchCost of req and excl, but it might be better to also use their costs as a weight, just like a conjunction. The patch uses equal weights for req and excl. Here the conjunction is (req and (not excl)), and for weighting their costs are req.cost() and (N - excl.cost()), with N as the number of docs in the segment. This would boil down to normally using req first, as expected. Only when excl matches a lot of docs, its weighting cost would be lower. Would it be worthwhile to use such weights here? See also LUCENE-6894 where the number of docs in a segment is used much more.
        Hide
        Adrien Grand added a comment -

        OK, I gave this idea a try: it weights the cost of matching the prohibited clause by the maximum ratio of the required clause that the prohibited clause might also match. For instance if the required clause has a cost of 1000 and the prohibited clause has a cost of 10, then we know we will be advancing the prohibited iterator and confirming the two-phase iterator at most 1% of the time so the final cost can be (${match cost of the required clause} + 0.01 * (${cost of advancing the prohibited iterator} + ${match cost of the prohibited clause})). Is it what you had in mind?

        Show
        Adrien Grand added a comment - OK, I gave this idea a try: it weights the cost of matching the prohibited clause by the maximum ratio of the required clause that the prohibited clause might also match. For instance if the required clause has a cost of 1000 and the prohibited clause has a cost of 10, then we know we will be advancing the prohibited iterator and confirming the two-phase iterator at most 1% of the time so the final cost can be (${match cost of the required clause} + 0.01 * (${cost of advancing the prohibited iterator} + ${match cost of the prohibited clause})). Is it what you had in mind?
        Hide
        Paul Elschot added a comment -

        I was thinking about using the cost of an excluded clause as (N - cost), and then being consistent with ConjunctionScorer.

        But we could also just wrap each MUST_NOT clause so it becomes a normal non scoring boolean clause. The wrapper would only implement NOT.

        Then ConjunctionScorer can do the sorting and the wrapped MUST_NOT would normally come last, just like the normal case here.

        Since the current ReqExclScorer is also a wrapper, I would expect that a wrapper for each excluded clause has similar performance. This might also simplify BooleanWeight.

        Does that sound feasible?

        Show
        Paul Elschot added a comment - I was thinking about using the cost of an excluded clause as (N - cost), and then being consistent with ConjunctionScorer. But we could also just wrap each MUST_NOT clause so it becomes a normal non scoring boolean clause. The wrapper would only implement NOT. Then ConjunctionScorer can do the sorting and the wrapped MUST_NOT would normally come last, just like the normal case here. Since the current ReqExclScorer is also a wrapper, I would expect that a wrapper for each excluded clause has similar performance. This might also simplify BooleanWeight. Does that sound feasible?
        Hide
        Paul Elschot added a comment -

        I tried implementing this NOT wrapper, but it is not feasible because the nextDoc() implementation will have to do a linear scan as long as the wrapped iterator provides consecutive docs.
        So this might be nice in theory, but it will not perform well.

        That means that I can't easily improve on the latest patch, it looks good, and core tests pass here.

        Show
        Paul Elschot added a comment - I tried implementing this NOT wrapper, but it is not feasible because the nextDoc() implementation will have to do a linear scan as long as the wrapped iterator provides consecutive docs. So this might be nice in theory, but it will not perform well. That means that I can't easily improve on the latest patch, it looks good, and core tests pass here.
        Hide
        Adrien Grand added a comment -

        Thanks Paul, I see what you meant now and I agree with your conclusion that it is not practical. This reminds me of a related idea that I had some time ago that some scorers return bitset-based iterators, which could easily be NOT'ed by using nextClearBit instead of nextSetBit so that we could apply them as a required clause (which should perform better) instead of a prohibited clause.

        I'll commit the patch shortly.

        Show
        Adrien Grand added a comment - Thanks Paul, I see what you meant now and I agree with your conclusion that it is not practical. This reminds me of a related idea that I had some time ago that some scorers return bitset-based iterators, which could easily be NOT'ed by using nextClearBit instead of nextSetBit so that we could apply them as a required clause (which should perform better) instead of a prohibited clause. I'll commit the patch shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1720544 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1720544 ]

        LUCENE-6926: Take the match cost into account for MUST_NOT clauses.

        Show
        ASF subversion and git services added a comment - Commit 1720544 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1720544 ] LUCENE-6926 : Take the match cost into account for MUST_NOT clauses.
        Hide
        ASF subversion and git services added a comment -

        Commit 1720548 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1720548 ]

        LUCENE-6926: Take the match cost into account for MUST_NOT clauses.

        Show
        ASF subversion and git services added a comment - Commit 1720548 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1720548 ] LUCENE-6926 : Take the match cost into account for MUST_NOT clauses.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development