Lucene - Core
  1. Lucene - Core
  2. LUCENE-6305

BooleanQuery.equals should ignore clause order

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      BooleanQuery.equals is sensitive to the order in which clauses have been added. So for instance "+A +B" would be considered different from "+B +A" although it generates the same matches and scores.

      1. LUCENE-6305.patch
        17 kB
        Adrien Grand
      2. LUCENE-6305.patch
        12 kB
        Adrien Grand
      3. LUCENE-6305.patch
        12 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        I think this issue is important if we want to be able to cache boolean queries efficiently.

        Here is a patch, it adds a simple HashMap-based Multiset (bag) implementation to oal.util and uses it in BooleanQuery so that the order in which clauses are added does not matter.

        Show
        Adrien Grand added a comment - I think this issue is important if we want to be able to cache boolean queries efficiently. Here is a patch, it adds a simple HashMap-based Multiset (bag) implementation to oal.util and uses it in BooleanQuery so that the order in which clauses are added does not matter.
        Hide
        Terry Smith added a comment -

        Having BooleanQuery.equals() ignore order is a great idea but I think it'd be better if we could preserve the original clause order for Query.toString(), the Explanation, debugging and test expectations.

        Additionally, I've been burnt by JVM changes to String.hashCode() that cause HashMap<String,?> to order entries differently when run in a newer JVM. Are the Query hash codes immune to this problem?

        Show
        Terry Smith added a comment - Having BooleanQuery.equals() ignore order is a great idea but I think it'd be better if we could preserve the original clause order for Query.toString(), the Explanation, debugging and test expectations. Additionally, I've been burnt by JVM changes to String.hashCode() that cause HashMap<String,?> to order entries differently when run in a newer JVM. Are the Query hash codes immune to this problem?
        Hide
        Adrien Grand added a comment -

        This is exactly what the patch is doing: clauses are put both into a list and a multiset, and the list is used for clauses()/toString() while the multiset is used for equals()/hashcode(). So the iteration order is predictable.

        Show
        Adrien Grand added a comment - This is exactly what the patch is doing: clauses are put both into a list and a multiset, and the list is used for clauses()/toString() while the multiset is used for equals()/hashcode(). So the iteration order is predictable.
        Hide
        Terry Smith added a comment -

        Oops, read the patch too quickly and missed that key detail! Sorry for the noise.

        Show
        Terry Smith added a comment - Oops, read the patch too quickly and missed that key detail! Sorry for the noise.
        Hide
        Terry Smith added a comment -

        Slightly off topic to your original goal, but what do you think about deduping repeated non scoring (FILTER, MUST_NOT) clauses from the list in the query or do you see that as an possible optimization when building the weights/scorers?

        Show
        Terry Smith added a comment - Slightly off topic to your original goal, but what do you think about deduping repeated non scoring (FILTER, MUST_NOT) clauses from the list in the query or do you see that as an possible optimization when building the weights/scorers?
        Hide
        Adrien Grand added a comment -

        Slightly off topic to your original goal, but what do you think about deduping repeated non scoring (FILTER, MUST_NOT) clauses from the list in the query or do you see that as an possible optimization when building the weights/scorers?

        +1 This would be a nice optimization for the rewrite method.

        Show
        Adrien Grand added a comment - Slightly off topic to your original goal, but what do you think about deduping repeated non scoring (FILTER, MUST_NOT) clauses from the list in the query or do you see that as an possible optimization when building the weights/scorers? +1 This would be a nice optimization for the rewrite method.
        Hide
        Adrien Grand added a comment -

        Revisiting this issue now that queries can be considered immutable. I rebased the patch to current trunk. BooleanQuery.equals/hashcode don't depend on the order of clauses anymore but the iteration order of the clauses and the toString() representation of BooleanQuery have not changed.

        Show
        Adrien Grand added a comment - Revisiting this issue now that queries can be considered immutable. I rebased the patch to current trunk. BooleanQuery.equals/hashcode don't depend on the order of clauses anymore but the iteration order of the clauses and the toString() representation of BooleanQuery have not changed.
        Hide
        Yonik Seeley added a comment -

        Cloud this optimization somehow be made optional?
        Anyone who is programmatically creating or mutating queries will naturally already order clauses.
        Also, doing something like adding another clause to an existing boolean query will do all the work of ordering yet again.
        Perhaps create a subclass for this stuff?

        Show
        Yonik Seeley added a comment - Cloud this optimization somehow be made optional? Anyone who is programmatically creating or mutating queries will naturally already order clauses. Also, doing something like adding another clause to an existing boolean query will do all the work of ordering yet again. Perhaps create a subclass for this stuff?
        Hide
        Adrien Grand added a comment -

        Actually I think of this change more as a bug fix than as an optimization: the order of the clauses has no meaning for BooleanQuery so it is silly that it is taken into account for equals/hashcode? I don't see why someone would like clause order to be meaningful?

        Show
        Adrien Grand added a comment - Actually I think of this change more as a bug fix than as an optimization: the order of the clauses has no meaning for BooleanQuery so it is silly that it is taken into account for equals/hashcode? I don't see why someone would like clause order to be meaningful?
        Hide
        Yonik Seeley added a comment -

        Actually I think of this change more as a bug fix than as an optimization

        If you have a system that always generates queries the same way (and there are quite a few of those who use lucene/solr), then the extra work is unwanted overhead. And in some cases, the overhead could be significant (very very large queries... i've seen examples).

        One could also check for more complex transformations to make apparently unequal boolean queries equal, and that would be great, but the cost should not be imposed on those who don't need/want it.

        Then think about taking one of those large queries and programmatically adding a single clause... all that work needs to be re-done?

        The same as checking for duplicate negative clauses - great if you have a system that could generate them I guess, but one should be able to opt-out.
        There are so many users, we need to continue to design for the corner cases as well as the mainstream, as long as those needs don't conflict. All that means in this case is to have a way to opt out of all the checking and say "just create me a weight with these clauses". I guess that could just be an expert static method on BooleanQuery?

        /** @lucene.expert */
        public static Weight createWeightFrom(List<BooleanClause> clauses)
        
        Show
        Yonik Seeley added a comment - Actually I think of this change more as a bug fix than as an optimization If you have a system that always generates queries the same way (and there are quite a few of those who use lucene/solr), then the extra work is unwanted overhead. And in some cases, the overhead could be significant (very very large queries... i've seen examples). One could also check for more complex transformations to make apparently unequal boolean queries equal, and that would be great, but the cost should not be imposed on those who don't need/want it. Then think about taking one of those large queries and programmatically adding a single clause... all that work needs to be re-done? The same as checking for duplicate negative clauses - great if you have a system that could generate them I guess, but one should be able to opt-out. There are so many users, we need to continue to design for the corner cases as well as the mainstream, as long as those needs don't conflict. All that means in this case is to have a way to opt out of all the checking and say "just create me a weight with these clauses". I guess that could just be an expert static method on BooleanQuery? /** @lucene.expert */ public static Weight createWeightFrom(List<BooleanClause> clauses)
        Hide
        Robert Muir added a comment -

        what a load of horseshit. Those expert users can use custom queries.

        Show
        Robert Muir added a comment - what a load of horseshit. Those expert users can use custom queries.
        Hide
        Yonik Seeley added a comment -

        what a load of horseshit

        I am continually reminded why I try to minimize any interaction in lucene these days... it's become the most hostile open source community I've ever seen.

        Those expert users can use custom queries.

        Which is what something like this would allow, without necessitating duplicating all of the boolean scorer logic:

        public static Weight createWeightFrom(List<BooleanClause> clauses)
        
        Show
        Yonik Seeley added a comment - what a load of horseshit I am continually reminded why I try to minimize any interaction in lucene these days... it's become the most hostile open source community I've ever seen. Those expert users can use custom queries. Which is what something like this would allow, without necessitating duplicating all of the boolean scorer logic: public static Weight createWeightFrom(List<BooleanClause> clauses)
        Hide
        Uwe Schindler added a comment -

        Then think about taking one of those large queries and programmatically adding a single clause... all that work needs to be re-done?

        In Lucene 6, BooleanQuery is unmodifiable, so you have to do the work anyways (or preserve the builder instance somewhere),

        Show
        Uwe Schindler added a comment - Then think about taking one of those large queries and programmatically adding a single clause... all that work needs to be re-done? In Lucene 6, BooleanQuery is unmodifiable, so you have to do the work anyways (or preserve the builder instance somewhere),
        Hide
        Adrien Grand added a comment -

        In general I don't like adding such methods because they increase the surface of our APIs and therefore its complexity. One trade-off I would be ok with would be to make the multiset lazily computed so that it does not get computed unless you call equals/hashcode. Would it work for you? If this is not enough, I tend to agree with Robert that the best option in that case might be to build a custom query.

        Show
        Adrien Grand added a comment - In general I don't like adding such methods because they increase the surface of our APIs and therefore its complexity. One trade-off I would be ok with would be to make the multiset lazily computed so that it does not get computed unless you call equals/hashcode. Would it work for you? If this is not enough, I tend to agree with Robert that the best option in that case might be to build a custom query.
        Hide
        Yonik Seeley added a comment -

        In Lucene 6, BooleanQuery is unmodifiable, so you have to do the work anyways (or preserve the builder instance somewhere),

        Correct... it's more about how expensive it is to build a boolean query.

        Show
        Yonik Seeley added a comment - In Lucene 6, BooleanQuery is unmodifiable, so you have to do the work anyways (or preserve the builder instance somewhere), Correct... it's more about how expensive it is to build a boolean query.
        Hide
        Robert Muir added a comment -

        I am continually reminded why I try to minimize any interaction in lucene these days... it's become the most hostile open source community I've ever seen.

        That would be great, then your passive-aggressive commentary (with no actual contributions) trying to hold back lucene would disappear.

        Your passive-aggressive crap is even more hostile. I am straightforward and call you out on it.

        Show
        Robert Muir added a comment - I am continually reminded why I try to minimize any interaction in lucene these days... it's become the most hostile open source community I've ever seen. That would be great, then your passive-aggressive commentary (with no actual contributions) trying to hold back lucene would disappear. Your passive-aggressive crap is even more hostile. I am straightforward and call you out on it.
        Hide
        Yonik Seeley added a comment -

        trying to hold back lucene

        That's the problem (among other things)... you see ulterior motives everywhere. Just like you accused Hoss of "You just want to make Lucene harder to use, so more people will use apache solr instead."
        https://issues.apache.org/jira/browse/LUCENE-5859?focusedCommentId=14080155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14080155

        Pretty sad.

        Show
        Yonik Seeley added a comment - trying to hold back lucene That's the problem (among other things)... you see ulterior motives everywhere. Just like you accused Hoss of "You just want to make Lucene harder to use, so more people will use apache solr instead." https://issues.apache.org/jira/browse/LUCENE-5859?focusedCommentId=14080155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14080155 Pretty sad.
        Hide
        Hoss Man added a comment -

        If you have a system that always generates queries the same way (and there are quite a few of those who use lucene/solr), then the extra work is unwanted overhead. And in some cases, the overhead could be significant (very very large queries... i've seen examples).

        +1

        One trade-off I would be ok with would be to make the multiset lazily computed so that it does not get computed unless you call equals/hashcode. ... the best option in that case might be to build a custom query.

        Why (in the latest patch) is the multiset (and multiset comparison in .equals) part of BooleanQuery at all?

        I wasn't following the interconnection between all of these issues very closely, but i thought Adrien mentioned at one point that one of the motivations/ benefits of the FooQuery.Builder APIs (and making the queries immutable) was that it would allow us to simplify (and make more efficient) some of the methods like .equals and .hashCode by moving the expensive logic (ie: deterministically ordering the clauses) into the Builder.build() methods so that the queries themselves wouldn't have to carry around the extra data structures or do any particularly complex logic in .equals()

        wouldn't that be the best of both worlds? only do the expensive comparisons / deterministic ordering of the BooleanClauses once in BooleanQuery.build(), and for "expert" cases where you are programatically creating adding clauses in a known deterministic order anyway, you can write you're own custom BooleanQuery.Builder (sub)class?

        Show
        Hoss Man added a comment - If you have a system that always generates queries the same way (and there are quite a few of those who use lucene/solr), then the extra work is unwanted overhead. And in some cases, the overhead could be significant (very very large queries... i've seen examples). +1 One trade-off I would be ok with would be to make the multiset lazily computed so that it does not get computed unless you call equals/hashcode. ... the best option in that case might be to build a custom query. Why (in the latest patch) is the multiset (and multiset comparison in .equals) part of BooleanQuery at all? I wasn't following the interconnection between all of these issues very closely, but i thought Adrien mentioned at one point that one of the motivations/ benefits of the FooQuery.Builder APIs (and making the queries immutable) was that it would allow us to simplify (and make more efficient) some of the methods like .equals and .hashCode by moving the expensive logic (ie: deterministically ordering the clauses) into the Builder.build() methods so that the queries themselves wouldn't have to carry around the extra data structures or do any particularly complex logic in .equals() wouldn't that be the best of both worlds? only do the expensive comparisons / deterministic ordering of the BooleanClauses once in BooleanQuery.build(), and for "expert" cases where you are programatically creating adding clauses in a known deterministic order anyway, you can write you're own custom BooleanQuery.Builder (sub)class?
        Hide
        David Smiley added a comment -

        I wasn't following the interconnection between all of these issues very closely, but i thought Adrien mentioned at one point that one of the motivations/ benefits of the FooQuery.Builder APIs (and making the queries immutable) was that it would allow us to simplify (and make more efficient) some of the methods like .equals and .hashCode by moving the expensive logic (ie: deterministically ordering the clauses) into the Builder.build() methods so that the queries themselves wouldn't have to carry around the extra data structures or do any particularly complex logic in .equals()

        wouldn't that be the best of both worlds? only do the expensive comparisons / deterministic ordering of the BooleanClauses once in BooleanQuery.build(), and for "expert" cases where you are programatically creating adding clauses in a known deterministic order anyway, you can write you're own custom BooleanQuery.Builder (sub)class?

        +1 Sounds great to me – best of both worlds indeed.

        Show
        David Smiley added a comment - I wasn't following the interconnection between all of these issues very closely, but i thought Adrien mentioned at one point that one of the motivations/ benefits of the FooQuery.Builder APIs (and making the queries immutable) was that it would allow us to simplify (and make more efficient) some of the methods like .equals and .hashCode by moving the expensive logic (ie: deterministically ordering the clauses) into the Builder.build() methods so that the queries themselves wouldn't have to carry around the extra data structures or do any particularly complex logic in .equals() wouldn't that be the best of both worlds? only do the expensive comparisons / deterministic ordering of the BooleanClauses once in BooleanQuery.build(), and for "expert" cases where you are programatically creating adding clauses in a known deterministic order anyway, you can write you're own custom BooleanQuery.Builder (sub)class? +1 Sounds great to me – best of both worlds indeed.
        Hide
        Adrien Grand added a comment -

        I don't think we can order queries in a deterministic way in BooleanQuery.Builder given that queries are not comparable? So we need the MultiSet?

        I wrote a simple micro-benchmark and was able to create more than 150k 100-clauses boolean queries per second on a single core, which is much faster than a typical per-core search request throughput. So I don't think we need to make it optional, we can just enable it all the time?

        Show
        Adrien Grand added a comment - I don't think we can order queries in a deterministic way in BooleanQuery.Builder given that queries are not comparable? So we need the MultiSet? I wrote a simple micro-benchmark and was able to create more than 150k 100-clauses boolean queries per second on a single core, which is much faster than a typical per-core search request throughput. So I don't think we need to make it optional, we can just enable it all the time?
        Hide
        Michael McCandless added a comment -

        So I don't think we need to make it optional, we can just enable it all the time?

        +1

        Show
        Michael McCandless added a comment - So I don't think we need to make it optional, we can just enable it all the time? +1
        Hide
        Ryan Ernst added a comment -

        +1 to enabling all the time

        Show
        Ryan Ernst added a comment - +1 to enabling all the time
        Hide
        Adrien Grand added a comment -

        Are there still objections to commit this patch in light of the latest comments?

        Show
        Adrien Grand added a comment - Are there still objections to commit this patch in light of the latest comments?
        Hide
        Jack Krupansky added a comment -

        No objection, but it would be good for the javadoc for BQ and BQ.Builder to explicitly state the contract that the order that clauses are added will not impact either the results of the query, their order, or the performance of the execution of the query - assuming those facts are all true.

        Show
        Jack Krupansky added a comment - No objection, but it would be good for the javadoc for BQ and BQ.Builder to explicitly state the contract that the order that clauses are added will not impact either the results of the query, their order, or the performance of the execution of the query - assuming those facts are all true.
        Hide
        Adrien Grand added a comment -

        Good call Jack.

        the order that clauses are added will not impact either the results of the query, their order,

        This is true.

        or the performance of the execution of the query

        This is mostly true today and should be completely true once LUCENE-6276 is in.

        Here is a new patch with improved javadocs. I also improved equals to ignore duplicates for FILTER and MUST_NOT clauses since they don't matter, and made the hashCode() cached.

        Show
        Adrien Grand added a comment - Good call Jack. the order that clauses are added will not impact either the results of the query, their order, This is true. or the performance of the execution of the query This is mostly true today and should be completely true once LUCENE-6276 is in. Here is a new patch with improved javadocs. I also improved equals to ignore duplicates for FILTER and MUST_NOT clauses since they don't matter, and made the hashCode() cached.
        Hide
        ASF subversion and git services added a comment -

        Commit 1708211 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1708211 ]

        LUCENE-6305: BooleanQuery.equals/hashcode ignore clause order.

        Show
        ASF subversion and git services added a comment - Commit 1708211 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1708211 ] LUCENE-6305 : BooleanQuery.equals/hashcode ignore clause order.
        Hide
        ASF subversion and git services added a comment -

        Commit 1708244 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1708244 ]

        LUCENE-6305: BooleanQuery.equals/hashcode ignore clause order.

        Show
        ASF subversion and git services added a comment - Commit 1708244 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1708244 ] LUCENE-6305 : BooleanQuery.equals/hashcode ignore clause order.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development