Lucene - Core
  1. Lucene - Core
  2. LUCENE-730

Restore top level disjunction performance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2
    • Component/s: core/search
    • Labels:
      None

      Description

      This patch restores the performance of top level disjunctions.
      The introduction of BooleanScorer2 had impacted this as reported
      on java-user on 21 Nov 2006 by Stanislav Jordanov.

      1. lucene-730.patch
        11 kB
        Michael Busch
      2. lucene-730.patch
        9 kB
        Michael Busch
      3. lucene-730.patch
        5 kB
        Michael Busch
      4. TopLevelDisjunction20061127.patch
        6 kB
        Paul Elschot

        Activity

        Hide
        Paul Elschot added a comment -

        This patches BooleanScorer2 to use BooleanScorer in the score(HitCollector) method.
        This also patches BooleanScorer to accept a minimum number of optional matchers.

        The patch also disables some test code: the use of checkSkipTo in QueryUtils
        caused a test failure in TestBoolean2 with the above changes. I think this
        could be expected because of the changed document scoring order
        for top level disjunction queries.
        At the moment I don't know how to resolve this.

        With the complete patch, all tests pass here.

        Show
        Paul Elschot added a comment - This patches BooleanScorer2 to use BooleanScorer in the score(HitCollector) method. This also patches BooleanScorer to accept a minimum number of optional matchers. The patch also disables some test code: the use of checkSkipTo in QueryUtils caused a test failure in TestBoolean2 with the above changes. I think this could be expected because of the changed document scoring order for top level disjunction queries. At the moment I don't know how to resolve this. With the complete patch, all tests pass here.
        Hide
        Otis Gospodnetic added a comment -

        Just a quick note that I contacted Stanislav Jordanov about Paul's patch here. Stanislav only used BooleanScorer.setUseScorer14() and that restored performance for him, but he did not try this patch (and won't be doing that as he's not working with Lucene at the moment).

        Show
        Otis Gospodnetic added a comment - Just a quick note that I contacted Stanislav Jordanov about Paul's patch here. Stanislav only used BooleanScorer.setUseScorer14() and that restored performance for him, but he did not try this patch (and won't be doing that as he's not working with Lucene at the moment).
        Hide
        Otis Gospodnetic added a comment -

        Paul, what is special about the number 32 here (BooleanScorer2):

        + if ((requiredScorers.size() == 0) &&
        + prohibitedScorers.size() < 32) {
        + // fall back to BooleanScorer, scores documents somewhat out of order
        + BooleanScorer bs = new BooleanScorer(getSimilarity(), minNrShouldMatch);

        Why can we use BooleanScorer if there are less than 32 prohibited clauses, but not otherwise? Thanks.

        Show
        Otis Gospodnetic added a comment - Paul, what is special about the number 32 here (BooleanScorer2): + if ((requiredScorers.size() == 0) && + prohibitedScorers.size() < 32) { + // fall back to BooleanScorer, scores documents somewhat out of order + BooleanScorer bs = new BooleanScorer(getSimilarity(), minNrShouldMatch); Why can we use BooleanScorer if there are less than 32 prohibited clauses, but not otherwise? Thanks.
        Hide
        Yonik Seeley added a comment -

        32 is the max number of required + prohibited clauses in the orig BooleanScorer (because it uses an int as a bitfield for each document in the current id range being considered).

        Show
        Yonik Seeley added a comment - 32 is the max number of required + prohibited clauses in the orig BooleanScorer (because it uses an int as a bitfield for each document in the current id range being considered).
        Hide
        Paul Elschot added a comment -

        Further to Yonik's answer, I have not done any tests with prohibited scorers comparing BooleanScorer and BooleanScorer2.

        It is quite possible that using skipTo() on any prohibited scorer (via BooleanScorer2) is generally faster than using BooleanScorer. Prohibited clauses in queries are quite seldom, so it is going to be difficult to find out whether a smaller value than 32 would be generally optimal.

        Show
        Paul Elschot added a comment - Further to Yonik's answer, I have not done any tests with prohibited scorers comparing BooleanScorer and BooleanScorer2. It is quite possible that using skipTo() on any prohibited scorer (via BooleanScorer2) is generally faster than using BooleanScorer. Prohibited clauses in queries are quite seldom, so it is going to be difficult to find out whether a smaller value than 32 would be generally optimal.
        Hide
        Otis Gospodnetic added a comment -

        I've committed this (changed a few minor things in the patch)...without benchmarking BS vs. BS2 with < 32 prohibited clauses.

        Hm, if I exposed that 32 as a static setter method, then one could easily benchmark and compare BS vs. BS2 with Doron's contrib/benchmark.

        Show
        Otis Gospodnetic added a comment - I've committed this (changed a few minor things in the patch)...without benchmarking BS vs. BS2 with < 32 prohibited clauses. Hm, if I exposed that 32 as a static setter method, then one could easily benchmark and compare BS vs. BS2 with Doron's contrib/benchmark.
        Hide
        Michael Busch added a comment -

        As discussed on java-dev the default behavior of BooleanScorer should be to return the documents in order, because there are people who rely in their apps on that. Docs out of order should only be allowed if BooleanQuery.setUseScorer14(true) is set explicitly.

        Show
        Michael Busch added a comment - As discussed on java-dev the default behavior of BooleanScorer should be to return the documents in order, because there are people who rely in their apps on that. Docs out of order should only be allowed if BooleanQuery.setUseScorer14(true) is set explicitly.
        Hide
        Michael Busch added a comment -

        With this patch the old BooleanScorer is only used if BooleanQuery.setUseScorer14(true) is set. It also enables the tests in QueryUtils again that check if the docs are returned in order.

        All tests pass.

        Show
        Michael Busch added a comment - With this patch the old BooleanScorer is only used if BooleanQuery.setUseScorer14(true) is set. It also enables the tests in QueryUtils again that check if the docs are returned in order. All tests pass.
        Hide
        Paul Elschot added a comment -

        The patch applies cleanly here, all core tests pass.
        And I like the allowDocsOutOfOrder approach.

        Show
        Paul Elschot added a comment - The patch applies cleanly here, all core tests pass. And I like the allowDocsOutOfOrder approach.
        Hide
        Michael Busch added a comment -

        Thanks for reviewing, Paul!

        I will commit this soon if nobody objects...

        Show
        Michael Busch added a comment - Thanks for reviewing, Paul! I will commit this soon if nobody objects...
        Hide
        Paul Elschot added a comment -

        No objection, only some remarks.

        One bigger issue:

        The latest patch defaults to docs in order above performance,
        but my personal taste is to have performance by default.

        And some smaller ones:

        One could still adapt QueryUtills to take the possibility
        of docs out of order into account.

        Some performance tests with prohibited scorers could still
        be needed to find out which of the boolean scorers does better
        on them.

        Show
        Paul Elschot added a comment - No objection, only some remarks. One bigger issue: The latest patch defaults to docs in order above performance, but my personal taste is to have performance by default. And some smaller ones: One could still adapt QueryUtills to take the possibility of docs out of order into account. Some performance tests with prohibited scorers could still be needed to find out which of the boolean scorers does better on them.
        Hide
        Doron Cohen added a comment -

        Two comments:

        With this patch the class BooleanWeight is not
        in (direct) use anymore - it is extended by
        BooleanWeight2 and then only the latter is used,
        and creates either Scorer2 or Scorer. We could
        get rid of BolleanWeight2, and have a single
        class BooleanWeight.

        Javadocs for useScorer14 methods:
        /**

        • Indicates that 1.4 BooleanScorer should be used.
        • Being static, This setting is system wide.
        • Scoring in 1.4 mode may be faster.
        • But note that unlike the default behavior, it does
        • not guarantee that docs are collected in docid
        • order. In other words, with this setting,
        • {@link HitCollector#collect(int,float)}

          might be

        • invoked first for docid N and only later for docid N-1.
          */
          public static void setUseScorer14(boolean use14) {

        /**

        • Whether 1.4 BooleanScorer should be used.
        • @see #setUseScorer14(boolean)
          */
          public static boolean getUseScorer14() {
        Show
        Doron Cohen added a comment - Two comments: With this patch the class BooleanWeight is not in (direct) use anymore - it is extended by BooleanWeight2 and then only the latter is used, and creates either Scorer2 or Scorer. We could get rid of BolleanWeight2, and have a single class BooleanWeight. Javadocs for useScorer14 methods: /** Indicates that 1.4 BooleanScorer should be used. Being static, This setting is system wide. Scoring in 1.4 mode may be faster. But note that unlike the default behavior, it does not guarantee that docs are collected in docid order. In other words, with this setting, {@link HitCollector#collect(int,float)} might be invoked first for docid N and only later for docid N-1. */ public static void setUseScorer14(boolean use14) { /** Whether 1.4 BooleanScorer should be used. @see #setUseScorer14(boolean) */ public static boolean getUseScorer14() {
        Hide
        Hoss Man added a comment -

        > The latest patch defaults to docs in order above performance,
        > but my personal taste is to have performance by default.

        I think it makes more sense to "default" to the most consistent rigidly defined behavior (docs in order), since that behavior will work (by definition) for any caller regardless of whether the caller expects the docs in order or not.

        people who find performance lacking can then assess their needs and make a conscious choice to change the setting, and see if it actually improves performance in their use cases.

        (ie: "avoid premature optimization" and all that)

        Show
        Hoss Man added a comment - > The latest patch defaults to docs in order above performance, > but my personal taste is to have performance by default. I think it makes more sense to "default" to the most consistent rigidly defined behavior (docs in order), since that behavior will work (by definition) for any caller regardless of whether the caller expects the docs in order or not. people who find performance lacking can then assess their needs and make a conscious choice to change the setting, and see if it actually improves performance in their use cases. (ie: "avoid premature optimization" and all that)
        Hide
        Michael Busch added a comment -

        > The latest patch defaults to docs in order above performance,
        > but my personal taste is to have performance by default.

        I agree with Hoss here. IMO allowing docs out of order is a big
        API change. I think if people switch to 2.2 they just want it
        to work as before without having to add special settings. If
        they need better performance for certain types of queries and
        they know that their application can deal with docs out of order
        they can enable the faster scoring.
        So my vote is +1 for docs in order by default.

        > Some performance tests with prohibited scorers could still
        > be needed to find out which of the boolean scorers does better
        > on them.

        That'd be helpful. However, I'm currently working on some other
        issues. Maybe you or others would have some time to run those
        tests?

        Show
        Michael Busch added a comment - > The latest patch defaults to docs in order above performance, > but my personal taste is to have performance by default. I agree with Hoss here. IMO allowing docs out of order is a big API change. I think if people switch to 2.2 they just want it to work as before without having to add special settings. If they need better performance for certain types of queries and they know that their application can deal with docs out of order they can enable the faster scoring. So my vote is +1 for docs in order by default. > Some performance tests with prohibited scorers could still > be needed to find out which of the boolean scorers does better > on them. That'd be helpful. However, I'm currently working on some other issues. Maybe you or others would have some time to run those tests?
        Hide
        Michael Busch added a comment -

        > With this patch the class BooleanWeight is not
        > in (direct) use anymore - it is extended by
        > BooleanWeight2 and then only the latter is used,
        > and creates either Scorer2 or Scorer. We could
        > get rid of BolleanWeight2, and have a single
        > class BooleanWeight.

        Agree. Will do.

        > Javadocs for useScorer14 methods:

        This is good! Thanks Doron, I will add the javadocs
        to my patch.

        Show
        Michael Busch added a comment - > With this patch the class BooleanWeight is not > in (direct) use anymore - it is extended by > BooleanWeight2 and then only the latter is used, > and creates either Scorer2 or Scorer. We could > get rid of BolleanWeight2, and have a single > class BooleanWeight. Agree. Will do. > Javadocs for useScorer14 methods: This is good! Thanks Doron, I will add the javadocs to my patch.
        Hide
        Michael Busch added a comment -

        New patch with the following changes:

        • Removes BooleanWeight2
        • Javadocs for useScorer14 methods provided by Doron
        Show
        Michael Busch added a comment - New patch with the following changes: Removes BooleanWeight2 Javadocs for useScorer14 methods provided by Doron
        Hide
        Paul Elschot added a comment -

        (Is the patch reversed? It did not apply at the first attempt,
        probably because my working copy is not the same as the trunk.)
        After ant clean, the boolean tests still pass here:
        ant -Dtestcase='TestBool*' test-core

        A slight improvement for the javadocs of BooleanQuery.java.
        In the javadocs of the scorer() method it is indicated that a BooleanScorer2
        will always be used, so it is better to mention here that BooleanScorer2
        delegates to a 1.4 scorer in some cases:

        /**

        • Indicates that BooleanScorer2 will delegate
        • the scoring to a 1.4 BooleanScorer
        • for most queries without required clauses.
        • Being static, this setting is system wide.
        • Scoring in 1.4 mode may be faster.
        • But note that unlike the default behavior, it does
        • not guarantee that docs are collected in docid
        • order. In other words, with this setting,
        • {@link HitCollector#collect(int,float)}

          might be

        • invoked first for docid N and only later for docid N-1.
          */
          public static void setUseScorer14(boolean use14) { useScorer14 = use14; }
        Show
        Paul Elschot added a comment - (Is the patch reversed? It did not apply at the first attempt, probably because my working copy is not the same as the trunk.) After ant clean, the boolean tests still pass here: ant -Dtestcase='TestBool*' test-core A slight improvement for the javadocs of BooleanQuery.java. In the javadocs of the scorer() method it is indicated that a BooleanScorer2 will always be used, so it is better to mention here that BooleanScorer2 delegates to a 1.4 scorer in some cases: /** Indicates that BooleanScorer2 will delegate the scoring to a 1.4 BooleanScorer for most queries without required clauses. Being static, this setting is system wide. Scoring in 1.4 mode may be faster. But note that unlike the default behavior, it does not guarantee that docs are collected in docid order. In other words, with this setting, {@link HitCollector#collect(int,float)} might be invoked first for docid N and only later for docid N-1. */ public static void setUseScorer14(boolean use14) { useScorer14 = use14; }
        Hide
        Michael Busch added a comment -

        > A slight improvement for the javadocs of BooleanQuery.java.
        > In the javadocs of the scorer() method it is indicated that a BooleanScorer2
        > will always be used, so it is better to mention here that BooleanScorer2
        > delegates to a 1.4 scorer in some cases:

        Maybe we should just deprecate the useScorer14 methods and add new methods
        allowDocsOutOfOrder. That should be easier to understand for the users.
        And probably most users don't know (or don't care about) the differences
        between BooleanScorer and BooleanScorer2 anyway.

        Show
        Michael Busch added a comment - > A slight improvement for the javadocs of BooleanQuery.java. > In the javadocs of the scorer() method it is indicated that a BooleanScorer2 > will always be used, so it is better to mention here that BooleanScorer2 > delegates to a 1.4 scorer in some cases: Maybe we should just deprecate the useScorer14 methods and add new methods allowDocsOutOfOrder. That should be easier to understand for the users. And probably most users don't know (or don't care about) the differences between BooleanScorer and BooleanScorer2 anyway.
        Hide
        Michael Busch added a comment -

        New patch that deprecates the useScorer14 methods and adds new
        methods:

        /**

        • Indicates whether hit docs may be collected out of docid
        • order. In other words, with this setting,
        • {@link HitCollector#collect(int,float)}

          might be

        • invoked first for docid N and only later for docid N-1.
        • Being static, this setting is system wide.
        • If docs out of order are allowed scoring might be faster
        • for certain queries (disjunction queries with less than
        • 32 prohibited terms). This setting has no effect for
        • other queries.
          */
          public static void setAllowDocsOutOfOrder(boolean allow);

        /**

        • Whether hit docs may be collected out of docid order.
        • @see #setAllowDocsOutOfOrder(boolean)
          */
          public static boolean getAllowDocsOutOfOrder();

        I think this is easier to understand for the users because it
        tells them what they need to know (docs in or out of order)
        and hides technical details (BooleanScorer vs. BooleanScorer2).

        All tests pass.

        Show
        Michael Busch added a comment - New patch that deprecates the useScorer14 methods and adds new methods: /** Indicates whether hit docs may be collected out of docid order. In other words, with this setting, {@link HitCollector#collect(int,float)} might be invoked first for docid N and only later for docid N-1. Being static, this setting is system wide. If docs out of order are allowed scoring might be faster for certain queries (disjunction queries with less than 32 prohibited terms). This setting has no effect for other queries. */ public static void setAllowDocsOutOfOrder(boolean allow); /** Whether hit docs may be collected out of docid order. @see #setAllowDocsOutOfOrder(boolean) */ public static boolean getAllowDocsOutOfOrder(); I think this is easier to understand for the users because it tells them what they need to know (docs in or out of order) and hides technical details (BooleanScorer vs. BooleanScorer2). All tests pass.
        Hide
        Michael Busch added a comment -

        I just committed the latest patch. Thanks everyone!

        Show
        Michael Busch added a comment - I just committed the latest patch. Thanks everyone!

          People

          • Assignee:
            Michael Busch
            Reporter:
            Paul Elschot
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development