Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Follow-up from LUCENE-6570: the fact that all queries are mutable in order to allow for applying a boost raises issues since it makes queries bad cache keys since their hashcode can change anytime. We could just document that queries should never be modified after they have gone through IndexSearcher but it would be even better if the API made queries impossible to mutate at all.

      I think there are two main options:

      • either replace "void setBoost(boost)" with something like "Query withBoost(boost)" which would return a clone that has a different boost
      • or move boost handling outside of Query, for instance we could have a (immutable) query impl that would be dedicated to applying boosts, that queries that need to change boosts at rewrite time (such as BooleanQuery) would use as a wrapper.

      The latter idea is from Robert and I like it a lot given how often I either introduced or found a bug which was due to the boost parameter being ignored. Maybe there are other options, but I think this is worth exploring.

      1. LUCENE-6590.patch
        379 kB
        Adrien Grand
      2. LUCENE-6590.patch
        378 kB
        Adrien Grand
      3. LUCENE-6590.patch
        370 kB
        Adrien Grand
      4. LUCENE-6590.patch
        366 kB
        Adrien Grand
      5. LUCENE-6590.patch
        365 kB
        Adrien Grand
      6. LUCENE-6590.patch
        151 kB
        Adrien Grand
      7. LUCENE-6590.patch
        138 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Removing boost from Query and having a dedicated BoostQuery makes the most sense if one were starting from scratch.
          We're not starting from scratch though (messing with boost would change a lot of code), so I'm not sure where that leaves us.

          Show
          Yonik Seeley added a comment - Removing boost from Query and having a dedicated BoostQuery makes the most sense if one were starting from scratch. We're not starting from scratch though (messing with boost would change a lot of code), so I'm not sure where that leaves us.
          Hide
          Robert Muir added a comment -

          If we just removed Query.get/setBoost and replaced with something like BoostQuery(Query q, float boost) then it could simplify some APIs in scoring as well.

          Today we already have to handle the case of 2 different boosts (boosts coming from parent queries in Weight.normalize, and also the leaf's own Query.boost). It makes things confusing in Weight/Similarity.normalize() where they must deal with these 2 separate boosts and combine them to one.

          So this could all be removed and reduced to boosts handled via one codepath in Weight:

            /** Assigns the query normalization factor and boost to this. */
            public abstract void normalize(float norm, float boost);
          

          That is in addition to reducing the possibility of bugs like forgetting to use boost higher up the stack, which I think is very common.

          Show
          Robert Muir added a comment - If we just removed Query.get/setBoost and replaced with something like BoostQuery(Query q, float boost) then it could simplify some APIs in scoring as well. Today we already have to handle the case of 2 different boosts (boosts coming from parent queries in Weight.normalize, and also the leaf's own Query.boost). It makes things confusing in Weight/Similarity.normalize() where they must deal with these 2 separate boosts and combine them to one. So this could all be removed and reduced to boosts handled via one codepath in Weight: /** Assigns the query normalization factor and boost to this . */ public abstract void normalize( float norm, float boost); That is in addition to reducing the possibility of bugs like forgetting to use boost higher up the stack, which I think is very common.
          Hide
          Michael McCandless added a comment -

          +1 to relegate boosting to a new BoostQuery, nomatter the pain!

          Show
          Michael McCandless added a comment - +1 to relegate boosting to a new BoostQuery, nomatter the pain!
          Hide
          David Smiley added a comment -

          RE BoostQuery

          +1 what a great idea Rob! I too have found bugs related to improper boost handling from time to time (just yesterday in fact). Heh, lets face it; every one of us have. It's so easy to overlook. I get that we're not starting from scratch (Yonik's point) but I'd hate to see such reasoning blocking Lucene from being the Lucene we want it to be. Pragmatically that means, I think, doing this in 6.0 and having a transition to this from 5.0. It's a pain but if someone is stepping up to do the work then great.

          Show
          David Smiley added a comment - RE BoostQuery +1 what a great idea Rob! I too have found bugs related to improper boost handling from time to time (just yesterday in fact). Heh, lets face it; every one of us have. It's so easy to overlook. I get that we're not starting from scratch (Yonik's point) but I'd hate to see such reasoning blocking Lucene from being the Lucene we want it to be. Pragmatically that means, I think, doing this in 6.0 and having a transition to this from 5.0. It's a pain but if someone is stepping up to do the work then great.
          Hide
          Adrien Grand added a comment -

          I started playing with this idea and here is a first step (only lucene/core and would deserve better docs but tests are passing) if someone would like to have a look. Clone, setBoost and getBoost are removed from the Query API in favour of a new BoostQuery, and query boosts are propagated through Query.createWeight. Next step is to figure out if we can remove this queryBoost parameter from createWeight (to get back to today's trunk signature) and handle everything through Weight.normalize like Robert suggested.

          Show
          Adrien Grand added a comment - I started playing with this idea and here is a first step (only lucene/core and would deserve better docs but tests are passing) if someone would like to have a look. Clone, setBoost and getBoost are removed from the Query API in favour of a new BoostQuery, and query boosts are propagated through Query.createWeight. Next step is to figure out if we can remove this queryBoost parameter from createWeight (to get back to today's trunk signature) and handle everything through Weight.normalize like Robert suggested.
          Hide
          Adrien Grand added a comment -

          New iteration that removes queryBoost from createWeight and applies both the query and the parent boost through Weight.normalize. Here is how it works:

          1. Weight.normalize is called a query norm of 1 and the query boost
          2. Weight.getValueForNormalization (which may take the query boost into account) is called to compute queryNorm
          3. Weight.normalize is called again with queryNorm and a boost which is equal to the query boost multiplied by the boost which is propagated by parent queries.

          The patch still only touches lucene/core. There is some documentation that needs to be updated, etc. but if someone could look at the API and how boosts are applied, that would be great. When we are happy with the approach then I can spend time fixing modules as well.

          Show
          Adrien Grand added a comment - New iteration that removes queryBoost from createWeight and applies both the query and the parent boost through Weight.normalize. Here is how it works: 1. Weight.normalize is called a query norm of 1 and the query boost 2. Weight.getValueForNormalization (which may take the query boost into account) is called to compute queryNorm 3. Weight.normalize is called again with queryNorm and a boost which is equal to the query boost multiplied by the boost which is propagated by parent queries. The patch still only touches lucene/core. There is some documentation that needs to be updated, etc. but if someone could look at the API and how boosts are applied, that would be great. When we are happy with the approach then I can spend time fixing modules as well.
          Hide
          Adrien Grand added a comment -

          Here is a patch that iterates over the previous one but for the entire code base. Overall, I think I like this change for several reasons:

          • Query rewriting is much easier as you don't need to care about boosts (I'm pretty sure the patch fixes some boost handling)
          • Queries don't need to incorporate the boost in their toString() representation
          • it removed the queryBoost parameter from Similarities and the top level boost and query boost are now applied the same way
          • it removed several APIs: Query.setBoost, Query.getBoost, Query.clone without adding new ones, we just have two additional query implementations: BoostQuery and SpanBoostQuery
          • It allows queries to be fully immutable, ie. we don't need to clone anymore when planning to use a Query as a cache key.

          The only downside I see is for the highlighting module: it has to deconstruct queries to understand how they are formed, and this new additional BoostQuery required to add a bit more unwrapping. However, I think the benefits outweight this slight inconvenience.

          Any opinions?

          Show
          Adrien Grand added a comment - Here is a patch that iterates over the previous one but for the entire code base. Overall, I think I like this change for several reasons: Query rewriting is much easier as you don't need to care about boosts (I'm pretty sure the patch fixes some boost handling) Queries don't need to incorporate the boost in their toString() representation it removed the queryBoost parameter from Similarities and the top level boost and query boost are now applied the same way it removed several APIs: Query.setBoost, Query.getBoost, Query.clone without adding new ones, we just have two additional query implementations: BoostQuery and SpanBoostQuery It allows queries to be fully immutable, ie. we don't need to clone anymore when planning to use a Query as a cache key. The only downside I see is for the highlighting module: it has to deconstruct queries to understand how they are formed, and this new additional BoostQuery required to add a bit more unwrapping. However, I think the benefits outweight this slight inconvenience. Any opinions?
          Hide
          Terry Smith added a comment -

          I think this looks great and will certainly make the boost handling more robust in my custom queries. Especially looking forward to fully immutable queries.

          What do you think is possible in terms of updating 5.x to make the transition easier?

          Show
          Terry Smith added a comment - I think this looks great and will certainly make the boost handling more robust in my custom queries. Especially looking forward to fully immutable queries. What do you think is possible in terms of updating 5.x to make the transition easier?
          Hide
          Yonik Seeley added a comment -

          Another thought... boosts really only make sense for boosting one clause over another.
          so another approach that might be less invasive (and won't break "instanceof" checks and current unwrapping code) is to just add boost to BooleanClause.

          In any case, it's certainly great to remove all the boost related stuff from most queries that don't care about it.

          Show
          Yonik Seeley added a comment - Another thought... boosts really only make sense for boosting one clause over another. so another approach that might be less invasive (and won't break "instanceof" checks and current unwrapping code) is to just add boost to BooleanClause. In any case, it's certainly great to remove all the boost related stuff from most queries that don't care about it.
          Hide
          Paul Elschot added a comment -

          This might have some influence on weighting in the surround (span) queries, but that can be fixed later if it is needed at all.

          Show
          Paul Elschot added a comment - This might have some influence on weighting in the surround (span) queries, but that can be fixed later if it is needed at all.
          Hide
          Adrien Grand added a comment -

          What do you think is possible in terms of updating 5.x to make the transition easier?

          My plan was to backport the change, make Query.clone/setBoost/getBoost deprecated, change the default rewrite implementation to something like:

            public Query rewrite(IndexReader reader) throws IOException {
              if (boost != 1f) {
                Query rewritten = clone();
                rewritten.setBoost(1f);
                rewritten = new BoostQuery(rewritten, boost);
                return rewritten;
              }
              return this;
            }
          

          and then review our existing queries to make sure they always end up delegating to super.rewrite (I know some of them, like DisjunctionMaxQuery, end up just returning "this", so that would have to be fixed to return super.rewrite() instead.)
          I think that should work?

          Another thought... boosts really only make sense for boosting one clause over another. so another approach that might be less invasive (and won't break "instanceof" checks and current unwrapping code) is to just add boost to BooleanClause.

          That's an interesting idea but I think this only applies to DefaultSimilarity? Other Similarities tend to handle boosts as multiplicative factors to the scores (because they return 1 in queryNorm instead of 1/sqrt(sumOfSquaredWeights))? Also this might be an issue for DisjunctionMaxQuery which could not have different boosts per sub query anymore?

          This might have some influence on weighting in the surround (span) queries, but that can be fixed later if it is needed at all.

          I'm curious what issues you are foreseeing? (I'm not very familiar with surround queries...)

          Show
          Adrien Grand added a comment - What do you think is possible in terms of updating 5.x to make the transition easier? My plan was to backport the change, make Query.clone/setBoost/getBoost deprecated, change the default rewrite implementation to something like: public Query rewrite(IndexReader reader) throws IOException { if (boost != 1f) { Query rewritten = clone(); rewritten.setBoost(1f); rewritten = new BoostQuery(rewritten, boost); return rewritten; } return this ; } and then review our existing queries to make sure they always end up delegating to super.rewrite (I know some of them, like DisjunctionMaxQuery, end up just returning "this", so that would have to be fixed to return super.rewrite() instead.) I think that should work? Another thought... boosts really only make sense for boosting one clause over another. so another approach that might be less invasive (and won't break "instanceof" checks and current unwrapping code) is to just add boost to BooleanClause. That's an interesting idea but I think this only applies to DefaultSimilarity? Other Similarities tend to handle boosts as multiplicative factors to the scores (because they return 1 in queryNorm instead of 1/sqrt(sumOfSquaredWeights))? Also this might be an issue for DisjunctionMaxQuery which could not have different boosts per sub query anymore? This might have some influence on weighting in the surround (span) queries, but that can be fixed later if it is needed at all. I'm curious what issues you are foreseeing? (I'm not very familiar with surround queries...)
          Hide
          Yonik Seeley added a comment -

          That's an interesting idea but I think this only applies to DefaultSimilarity? Other Similarities tend to handle boosts as multiplicative factors to the scores (because they return 1 in queryNorm instead of 1/sqrt(sumOfSquaredWeights))?

          If you're talking about a single query being boosted? That edge case could be handled via a boolean query with a single clause (with that clause holding the boost).

          Also this might be an issue for DisjunctionMaxQuery which could not have different boosts per sub query anymore?

          Yeah, that would also need clauses (and boosts on those clauses.)
          Just throwing the idea out there... you're the one doing the work!

          Show
          Yonik Seeley added a comment - That's an interesting idea but I think this only applies to DefaultSimilarity? Other Similarities tend to handle boosts as multiplicative factors to the scores (because they return 1 in queryNorm instead of 1/sqrt(sumOfSquaredWeights))? If you're talking about a single query being boosted? That edge case could be handled via a boolean query with a single clause (with that clause holding the boost). Also this might be an issue for DisjunctionMaxQuery which could not have different boosts per sub query anymore? Yeah, that would also need clauses (and boosts on those clauses.) Just throwing the idea out there... you're the one doing the work!
          Hide
          Paul Elschot added a comment -

          The surround query language allows to add a boost value to sub queries of OR and of proximity. For OR boolean queries, the given boosts are passed to the generated boolean query clauses and they are used there as expected. For proximity and for OR within proximity, these boosts are passed to the generated span queries, but lucene ignores these boosts at the moment so there is no influence from the changes here.

          This weighting is done in the SrndQuery.makeLuceneQueryField method which is nicely patched here.
          However there are no tests for expected score values of surround queries.

          Show
          Paul Elschot added a comment - The surround query language allows to add a boost value to sub queries of OR and of proximity. For OR boolean queries, the given boosts are passed to the generated boolean query clauses and they are used there as expected. For proximity and for OR within proximity, these boosts are passed to the generated span queries, but lucene ignores these boosts at the moment so there is no influence from the changes here. This weighting is done in the SrndQuery.makeLuceneQueryField method which is nicely patched here. However there are no tests for expected score values of surround queries.
          Hide
          Paul Elschot added a comment -

          I overlooked that the patch also removes setting the boost of the span query generated from the surround sub query, see SpanNearClauseFactory in the patch. For now this will not have any influence because span query boosts in sub queries are ignored anyway.

          Show
          Paul Elschot added a comment - I overlooked that the patch also removes setting the boost of the span query generated from the surround sub query, see SpanNearClauseFactory in the patch. For now this will not have any influence because span query boosts in sub queries are ignored anyway.
          Hide
          Adrien Grand added a comment -

          Good catch Paul, I added this logic back.

          Show
          Adrien Grand added a comment - Good catch Paul, I added this logic back.
          Hide
          Paul Elschot added a comment -

          The new SpanBoostQuery, thanks

          Show
          Paul Elschot added a comment - The new SpanBoostQuery, thanks
          Hide
          Adrien Grand added a comment -

          Sorry Paul, I don't understand your last comment?

          Show
          Adrien Grand added a comment - Sorry Paul, I don't understand your last comment?
          Hide
          Adrien Grand added a comment -

          Iterated on the patch to make our queries return `super.rewrite` instead of `this` when they exhausted their rewrite rules to prepare for the backport. This way we could nicely handle backward compatibility by fixing the base Query.rewrite to return a BoostQuery around itself when the boost is not 1.

          Show
          Adrien Grand added a comment - Iterated on the patch to make our queries return `super.rewrite` instead of `this` when they exhausted their rewrite rules to prepare for the backport. This way we could nicely handle backward compatibility by fixing the base Query.rewrite to return a BoostQuery around itself when the boost is not 1.
          Hide
          Adrien Grand added a comment -

          Are there any objections to committing this?

          Show
          Adrien Grand added a comment - Are there any objections to committing this?
          Hide
          Paul Elschot added a comment -

          Quite the contrary.

          I meant that the patch now uses SpanBoostQuery to contain any weights given to the surround proximity subqueries.

          Show
          Paul Elschot added a comment - Quite the contrary. I meant that the patch now uses SpanBoostQuery to contain any weights given to the surround proximity subqueries.
          Hide
          Adrien Grand added a comment -

          Patch updated to trunk. The patch is large and a bit painful to maintain so if you would like to review it, please let me know. Otherwise I will commit soon.

          Show
          Adrien Grand added a comment - Patch updated to trunk. The patch is large and a bit painful to maintain so if you would like to review it, please let me know. Otherwise I will commit soon.
          Hide
          Paul Elschot added a comment -

          LGTM.

          Show
          Paul Elschot added a comment - LGTM.
          Hide
          Alan Woodward added a comment -

          This looks great.

          I think there's a bug in BoostQuery.rewrite(). If the boost is set to 1, then rewrite() just returns the inner query, without rewriting it. It should return query.rewrite() instead.

          Show
          Alan Woodward added a comment - This looks great. I think there's a bug in BoostQuery.rewrite(). If the boost is set to 1, then rewrite() just returns the inner query, without rewriting it. It should return query.rewrite() instead.
          Hide
          Uwe Schindler added a comment -

          Rewrite works iterative until the query no longer rewrites. See IndexSearcher's rewrite/create weight for the corresponding while loop.
          So this still works, it just needs one iteration more.

          Show
          Uwe Schindler added a comment - Rewrite works iterative until the query no longer rewrites. See IndexSearcher's rewrite/create weight for the corresponding while loop. So this still works, it just needs one iteration more.
          Hide
          Yonik Seeley added a comment -

          Rewrite works iterative until the query no longer rewrites. See IndexSearcher's rewrite/create weight for the corresponding while loop. So this still works, it just needs one iteration more.

          Which causes a clone of the entire query tree again? Seems like in general we should try to minimize that.

          Show
          Yonik Seeley added a comment - Rewrite works iterative until the query no longer rewrites. See IndexSearcher's rewrite/create weight for the corresponding while loop. So this still works, it just needs one iteration more. Which causes a clone of the entire query tree again? Seems like in general we should try to minimize that.
          Hide
          Adrien Grand added a comment -

          I don't mind refactoring BoostQuery.rewrite so that query rewriting converges a bit faster, here is an updated patch.

          Show
          Adrien Grand added a comment - I don't mind refactoring BoostQuery.rewrite so that query rewriting converges a bit faster, here is an updated patch.
          Hide
          ASF subversion and git services added a comment -

          Commit 1701621 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1701621 ]

          LUCENE-6590: Replace Query.getBoost, setBoost and clone with a new BoostQuery.

          Show
          ASF subversion and git services added a comment - Commit 1701621 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1701621 ] LUCENE-6590 : Replace Query.getBoost, setBoost and clone with a new BoostQuery.
          Hide
          ASF subversion and git services added a comment -

          Commit 1701742 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1701742 ]

          LUCENE-6590: Fix BooleanQuery to not propagate query boosts twice.

          Show
          ASF subversion and git services added a comment - Commit 1701742 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1701742 ] LUCENE-6590 : Fix BooleanQuery to not propagate query boosts twice.
          Hide
          ASF subversion and git services added a comment -

          Commit 1701783 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1701783 ]

          LUCENE-6590,LUCENE-6783: Replace Query.getBoost, setBoost and clone with a new BoostQuery.

          Show
          ASF subversion and git services added a comment - Commit 1701783 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1701783 ] LUCENE-6590 , LUCENE-6783 : Replace Query.getBoost, setBoost and clone with a new BoostQuery.
          Hide
          ASF subversion and git services added a comment -

          Commit 1702042 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1702042 ]

          LUCENE-6590: Add back ToStringUtils.boost, which should not have been removed while merging.

          Show
          ASF subversion and git services added a comment - Commit 1702042 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1702042 ] LUCENE-6590 : Add back ToStringUtils.boost, which should not have been removed while merging.
          Hide
          ASF subversion and git services added a comment -

          Commit 1702088 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1702088 ]

          LUCENE-6590: Fix boost handling.

          Show
          ASF subversion and git services added a comment - Commit 1702088 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1702088 ] LUCENE-6590 : Fix boost handling.
          Hide
          ASF subversion and git services added a comment -

          Commit 1702090 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1702090 ]

          LUCENE-6590: More missing calls to ToStringUtils.boost.

          Show
          ASF subversion and git services added a comment - Commit 1702090 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1702090 ] LUCENE-6590 : More missing calls to ToStringUtils.boost.
          Hide
          ASF subversion and git services added a comment -

          Commit 1702263 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1702263 ]

          LUCENE-6590: Make sure the fast-vector-highlighter also handles boosts set via Query.setBoost.

          Show
          ASF subversion and git services added a comment - Commit 1702263 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1702263 ] LUCENE-6590 : Make sure the fast-vector-highlighter also handles boosts set via Query.setBoost.
          Hide
          Terry Smith added a comment -

          Adrien Grand: PhraseQuery is missing a call to ToStringUtils.boost in it's toString method on the 5.x branch.

          Show
          Terry Smith added a comment - Adrien Grand : PhraseQuery is missing a call to ToStringUtils.boost in it's toString method on the 5.x branch.
          Hide
          Terry Smith added a comment -

          Hmm, so is NumericRangeQuery.

          Show
          Terry Smith added a comment - Hmm, so is NumericRangeQuery.
          Hide
          Terry Smith added a comment -

          Also FunctionQuery.

          Show
          Terry Smith added a comment - Also FunctionQuery.
          Hide
          Adrien Grand added a comment -

          Thanks Terry, I'll fix.

          Show
          Adrien Grand added a comment - Thanks Terry, I'll fix.
          Hide
          Terry Smith added a comment - - edited

          Cheers Adrien. Sorry for the spammy replies before – I wasn't expecting to see more than one discrepancy!

          While you are looking at the Query.toString() behavior with respect to boosting, how would you feel about adding MatchAllDocsQuery.class to BoostQuery.NO_PARENS_REQUIRED_QUERIES so it's toString() doesn't change across releases?

          Query q = new MatchAllDocsQuery();
          q.setBoost(0);
          q.toString() -> *:*^0.0
          
          new BoostQuery(new MatchAllDocsQuery(), 0).toString() -> (*:*)^0.0
          
          Show
          Terry Smith added a comment - - edited Cheers Adrien. Sorry for the spammy replies before – I wasn't expecting to see more than one discrepancy! While you are looking at the Query.toString() behavior with respect to boosting, how would you feel about adding MatchAllDocsQuery.class to BoostQuery.NO_PARENS_REQUIRED_QUERIES so it's toString() doesn't change across releases? Query q = new MatchAllDocsQuery(); q.setBoost(0); q.toString() -> *:*^0.0 new BoostQuery(new MatchAllDocsQuery(), 0).toString() -> (*:*)^0.0
          Hide
          ASF subversion and git services added a comment -

          Commit 1706827 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1706827 ]

          LUCENE-6590: Fix toString() representations of queries to include the boost.

          Some of them were lost in the merge, others were just missing.

          Show
          ASF subversion and git services added a comment - Commit 1706827 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1706827 ] LUCENE-6590 : Fix toString() representations of queries to include the boost. Some of them were lost in the merge, others were just missing.
          Hide
          Terry Smith added a comment -

          Thanks Adrien.

          My nightly regressions just picked this up from the published maven snapshots and I see that BoostQuery now includes MatchAllDocsQuery in it's NO_PARENS_REQUIRED_QUERIES list on branch_5x (this is awesome!). However this change is not available on trunk.

          I've confirmed by checked the SVN repo directly (the github mirror tends to lag).

          http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java

          http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java

          --Terry

          Show
          Terry Smith added a comment - Thanks Adrien. My nightly regressions just picked this up from the published maven snapshots and I see that BoostQuery now includes MatchAllDocsQuery in it's NO_PARENS_REQUIRED_QUERIES list on branch_5x (this is awesome!). However this change is not available on trunk. I've confirmed by checked the SVN repo directly (the github mirror tends to lag). http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/lucene/core/src/java/org/apache/lucene/search/BoostQuery.java --Terry
          Hide
          Adrien Grand added a comment -

          Thanks for confirming it works for you now. Regarding trunk, I think we should remove the hack entirely instead of adding more exceptions to this hack? I opened LUCENE-6834.

          Show
          Adrien Grand added a comment - Thanks for confirming it works for you now. Regarding trunk, I think we should remove the hack entirely instead of adding more exceptions to this hack? I opened LUCENE-6834 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1722478 from Ramkumar Aiyengar in branch 'dev/trunk'
          [ https://svn.apache.org/r1722478 ]

          SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser

          Show
          ASF subversion and git services added a comment - Commit 1722478 from Ramkumar Aiyengar in branch 'dev/trunk' [ https://svn.apache.org/r1722478 ] SOLR-8418 : Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser
          Hide
          ASF subversion and git services added a comment -

          Commit 1722488 from Ramkumar Aiyengar in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1722488 ]

          SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser

          Show
          ASF subversion and git services added a comment - Commit 1722488 from Ramkumar Aiyengar in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1722488 ] SOLR-8418 : Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser
          Hide
          ASF subversion and git services added a comment -

          Commit 1725144 from Ramkumar Aiyengar in branch 'dev/branches/lucene_solr_5_4'
          [ https://svn.apache.org/r1725144 ]

          SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser

          Show
          ASF subversion and git services added a comment - Commit 1725144 from Ramkumar Aiyengar in branch 'dev/branches/lucene_solr_5_4' [ https://svn.apache.org/r1725144 ] SOLR-8418 : Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser
          Hide
          ASF subversion and git services added a comment -

          Commit 340dc9ca5039244b2a78e284dd707e4466b3f3d4 in lucene-solr's branch refs/heads/branch_5_4 from Ramkumar Aiyengar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=340dc9c ]

          SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser

          git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_4@1725144 13f79535-47bb-0310-9956-ffa450edef68

          Show
          ASF subversion and git services added a comment - Commit 340dc9ca5039244b2a78e284dd707e4466b3f3d4 in lucene-solr's branch refs/heads/branch_5_4 from Ramkumar Aiyengar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=340dc9c ] SOLR-8418 : Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and Simple/CloudMLTQParser git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_4@1725144 13f79535-47bb-0310-9956-ffa450edef68
          Hide
          Upayavira added a comment -

          Adrien Grand this commit made this change:
          {{
          @Override

          • public void normalize(float queryNorm, float topLevelBoost) {
          • this.queryNorm = queryNorm * topLevelBoost;
          • queryWeight *= this.queryNorm; // normalize query weight
            + public void normalize(float queryNorm, float boost) { + this.boost = boost; + this.queryNorm = queryNorm; + queryWeight = queryNorm * boost * idf.getValue(); value = queryWeight * idf.getValue(); // idf for document }

            }}

          Looking at this, before, the queryWeight was just the queryNorm* boost. Now, it is qeueryNorm* boost * IDF.

          This means I'm seeing scores which instead of being TF * IDF are now TF * IDF * IDF.

          Was this an intentional change?

          Show
          Upayavira added a comment - Adrien Grand this commit made this change: {{ @Override public void normalize(float queryNorm, float topLevelBoost) { this.queryNorm = queryNorm * topLevelBoost; queryWeight *= this.queryNorm; // normalize query weight + public void normalize(float queryNorm, float boost) { + this.boost = boost; + this.queryNorm = queryNorm; + queryWeight = queryNorm * boost * idf.getValue(); value = queryWeight * idf.getValue(); // idf for document } }} Looking at this, before, the queryWeight was just the queryNorm* boost. Now, it is qeueryNorm* boost * IDF. This means I'm seeing scores which instead of being TF * IDF are now TF * IDF * IDF. Was this an intentional change?
          Hide
          Adrien Grand added a comment -

          The code was reorganized a bit, but it should work the same way as before. In short, one of the idf factors should be counter balanced by the query norm. Here is what happens in sequence for a term query with a boost of 1 (default):

          • upon initialization, IDFStats calls normalize(1,1), which cause queryWeight to be equal to the idf
          • then IndexSearcher calls getValueForNormalization, which return queryWeight*queryWeight=idf^2
          • then IndexSearcher calls norm=ClassicSimilarity.queryNorm(valueForNormalization) which is 1/sqrt(valueForNormalization) = 1/idf
          • then IndexSearcher calls normalize(norm, 1), which gives value = norm * 1 * idf * idf = 1/idf * 1 * idf * idf = idf

          Maybe your code forgot to normalize based on the value of getValueForNormalization somehow?

          Show
          Adrien Grand added a comment - The code was reorganized a bit, but it should work the same way as before. In short, one of the idf factors should be counter balanced by the query norm. Here is what happens in sequence for a term query with a boost of 1 (default): upon initialization, IDFStats calls normalize(1,1), which cause queryWeight to be equal to the idf then IndexSearcher calls getValueForNormalization, which return queryWeight*queryWeight=idf^2 then IndexSearcher calls norm=ClassicSimilarity.queryNorm(valueForNormalization) which is 1/sqrt(valueForNormalization) = 1/idf then IndexSearcher calls normalize(norm, 1), which gives value = norm * 1 * idf * idf = 1/idf * 1 * idf * idf = idf Maybe your code forgot to normalize based on the value of getValueForNormalization somehow?
          Hide
          Upayavira added a comment -

          My code in this case is just Solr, so it seems that in this change, the "normalize based on the value of getvalueForNormalization" wasn't done in Solr.

          I've looked briefly at the SolrIndexSearcher, which subclasses IndexSearcher, but it doesn't seem to mess with normalization. Any chance you could take a look at SolrIndexSearcher and see if something was missed there? I'm definitely seeing scores which are tf*tf*idf in Solr results.

          Show
          Upayavira added a comment - My code in this case is just Solr, so it seems that in this change, the "normalize based on the value of getvalueForNormalization" wasn't done in Solr. I've looked briefly at the SolrIndexSearcher, which subclasses IndexSearcher, but it doesn't seem to mess with normalization. Any chance you could take a look at SolrIndexSearcher and see if something was missed there? I'm definitely seeing scores which are tf*tf*idf in Solr results.
          Hide
          Adrien Grand added a comment -

          Sure I can have a look. Do you have the similarity explicitly set in your schema? If not, what is the Lucene match version of your schema? (Solr picks the default similarity based on this version.)

          Show
          Adrien Grand added a comment - Sure I can have a look. Do you have the similarity explicitly set in your schema? If not, what is the Lucene match version of your schema? (Solr picks the default similarity based on this version.)
          Hide
          Adrien Grand added a comment -

          Maybe you can write a simple test case so that I can dig into it?

          Show
          Adrien Grand added a comment - Maybe you can write a simple test case so that I can dig into it?
          Hide
          Upayavira added a comment -

          My schema explicitly specifies the ClassicSimilarity. My Lucene match version was 4.6. I increased it to 5.5 and the behaviour was the same (this is using a Solr 5.5 system).

          I could try and knock up a test case, but to-date I've avoided coding in Java on Solr/Lucene, so not yet familiar with the various test frameworks.

          Show
          Upayavira added a comment - My schema explicitly specifies the ClassicSimilarity. My Lucene match version was 4.6. I increased it to 5.5 and the behaviour was the same (this is using a Solr 5.5 system). I could try and knock up a test case, but to-date I've avoided coding in Java on Solr/Lucene, so not yet familiar with the various test frameworks.
          Hide
          Adrien Grand added a comment -

          I just downloaded Solr 5.5 and indexed a couple documents with a "foo" field, then ran a query on "foo:bar", the explanation shows that the idf is taken into account only once:

          1.4519851 = weight(foo:bar in 0) [ClassicSimilarity], result of:
            1.4519851 = fieldWeight in 0, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              1.4519851 = idf(docFreq=6, maxDocs=11)
              1.0 = fieldNorm(doc=0)
          
          Show
          Adrien Grand added a comment - I just downloaded Solr 5.5 and indexed a couple documents with a "foo" field, then ran a query on "foo:bar", the explanation shows that the idf is taken into account only once: 1.4519851 = weight(foo:bar in 0) [ClassicSimilarity], result of: 1.4519851 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.4519851 = idf(docFreq=6, maxDocs=11) 1.0 = fieldNorm(doc=0)
          Hide
          Adrien Grand added a comment -

          What does the explanation look like in your case?

          Show
          Adrien Grand added a comment - What does the explanation look like in your case?
          Hide
          Upayavira added a comment -

          Here's an example for 4.10 and the same query against 5.5 - note, it is a different doc though:

          4.10 score ========================================================
                "2937439": {
                  "match": true,
                  "value": 5.5993805,
                  "description": "weight(description:obama in 394012)
                  [DefaultSimilarity], result of:",
                  "details": [
                    {
                      "match": true,
                      "value": 5.5993805,
                      "description": "fieldWeight in 394012, product of:",
                      "details": [
                        {
                          "match": true,
                          "value": 1,
                          "description": "tf(freq=1.0), with freq of:",
                          "details": [
                            {
                              "match": true,
                              "value": 1,
                              "description": "termFreq=1.0"
                            }
                          ]
                        },
                        {
                          "match": true,
                          "value": 5.5993805,
                          "description": "idf(docFreq=56010, maxDocs=5568765)"
                        },
                        {
                          "match": true,
                          "value": 1,
                          "description": "fieldNorm(doc=394012)"
                        }
                      ]
                    }
                  ]
          5.5 score ========================================================
                "2502281":{
                  "match":true,
                  "value":28.51136,
                  "description":"weight(description:obama in 43472) [], result
                  of:",
                  "details":[{
                      "match":true,
                      "value":28.51136,
                      "description":"score(doc=43472,freq=1.0), product of:",
                      "details":[{
                          "match":true,
                          "value":5.339603,
                          "description":"queryWeight, product of:",
                          "details":[{
                              "match":true,
                              "value":5.339603,
                              "description":"idf(docFreq=31905,
                              maxDocs=2446459)"},
                            {
                              "match":true,
                              "value":1.0,
                              "description":"queryNorm"}]},
                        {
                          "match":true,
                          "value":5.339603,
                          "description":"fieldWeight in 43472, product of:",
                          "details":[{
                              "match":true,
                              "value":1.0,
                              "description":"tf(freq=1.0), with freq of:",
                              "details":[{
                                  "match":true,
                                  "value":1.0,
                                  "description":"termFreq=1.0"}]},
                            {
                              "match":true,
                              "value":5.339603,
                              "description":"idf(docFreq=31905,
                              maxDocs=2446459)"},
                            {
                              "match":true,
                              "value":1.0,
                              "description":"fieldNorm(doc=43472)"}]}]}]},
          
          Show
          Upayavira added a comment - Here's an example for 4.10 and the same query against 5.5 - note, it is a different doc though: 4.10 score ======================================================== "2937439" : { "match" : true , "value" : 5.5993805, "description" : "weight(description:obama in 394012) [DefaultSimilarity], result of:", "details" : [ { "match" : true , "value" : 5.5993805, "description" : "fieldWeight in 394012, product of:" , "details" : [ { "match" : true , "value" : 1, "description" : "tf(freq=1.0), with freq of:" , "details" : [ { "match" : true , "value" : 1, "description" : "termFreq=1.0" } ] }, { "match" : true , "value" : 5.5993805, "description" : "idf(docFreq=56010, maxDocs=5568765)" }, { "match" : true , "value" : 1, "description" : "fieldNorm(doc=394012)" } ] } ] 5.5 score ======================================================== "2502281" :{ "match" : true , "value" :28.51136, "description" :"weight(description:obama in 43472) [], result of:", "details" :[{ "match" : true , "value" :28.51136, "description" : "score(doc=43472,freq=1.0), product of:" , "details" :[{ "match" : true , "value" :5.339603, "description" : "queryWeight, product of:" , "details" :[{ "match" : true , "value" :5.339603, "description" :"idf(docFreq=31905, maxDocs=2446459)"}, { "match" : true , "value" :1.0, "description" : "queryNorm" }]}, { "match" : true , "value" :5.339603, "description" : "fieldWeight in 43472, product of:" , "details" :[{ "match" : true , "value" :1.0, "description" : "tf(freq=1.0), with freq of:" , "details" :[{ "match" : true , "value" :1.0, "description" : "termFreq=1.0" }]}, { "match" : true , "value" :5.339603, "description" :"idf(docFreq=31905, maxDocs=2446459)"}, { "match" : true , "value" :1.0, "description" : "fieldNorm(doc=43472)" }]}]}]},
          Hide
          Upayavira added a comment -

          Any ideas Adrien Grand

          Show
          Upayavira added a comment - Any ideas Adrien Grand
          Hide
          Adrien Grand added a comment -

          Thanks for the ping, I had missed your previous message. The bug is that queryNorm should not be 1.0 in the 5.5 explanation. There must be something that by-passes query normalization somewhere. I believe your query was a simple term query for description:obama, is it correct? Since I had run something similar and did not reproduce the bug, I believe there must be something specific to your setup that triggers this problem. Could you try to build a reproducible test case so that I can dig what is happening, either an actual test case or a sequence of commands that I can run against Solr to reproduce the problem?

          Show
          Adrien Grand added a comment - Thanks for the ping, I had missed your previous message. The bug is that queryNorm should not be 1.0 in the 5.5 explanation. There must be something that by-passes query normalization somewhere. I believe your query was a simple term query for description:obama, is it correct? Since I had run something similar and did not reproduce the bug, I believe there must be something specific to your setup that triggers this problem. Could you try to build a reproducible test case so that I can dig what is happening, either an actual test case or a sequence of commands that I can run against Solr to reproduce the problem?
          Hide
          Upayavira added a comment -

          It is quite possibly something in my setup. I figured because someone else reported the same issue it might be more global, but I think now it is time for me to assume I've done something stupid. Apologies and thanks.

          Show
          Upayavira added a comment - It is quite possibly something in my setup. I figured because someone else reported the same issue it might be more global, but I think now it is time for me to assume I've done something stupid. Apologies and thanks.
          Hide
          Upayavira added a comment -

          It seems a previous test I did was flawed (I thought I was pushing updated configs, but suspect that I was actually pushing old ones). Scoring is now working correctly, the main change being an update of the Lucene match version from 4.6 to 5.5.

          Show
          Upayavira added a comment - It seems a previous test I did was flawed (I thought I was pushing updated configs, but suspect that I was actually pushing old ones). Scoring is now working correctly, the main change being an update of the Lucene match version from 4.6 to 5.5.
          Hide
          Adrien Grand added a comment -

          I tried dowgrading the lucene match version to 4.6 on my local Solr installation but this does not help reproduce the problem. I am still interested in getting to the bottom of this, especially if other users are hitting the same problem, so if you manage to narrow it down to some specific configuration changes that would be helpful.

          Show
          Adrien Grand added a comment - I tried dowgrading the lucene match version to 4.6 on my local Solr installation but this does not help reproduce the problem. I am still interested in getting to the bottom of this, especially if other users are hitting the same problem, so if you manage to narrow it down to some specific configuration changes that would be helpful.
          Hide
          Upayavira added a comment -

          I upgraded my Solr configs to match the 5.5, and the issue was still there. I eventually tracked down this snippet in Solr's SchemaSimilarityFactory.java:

          {{

          • <p>
          • <b>NOTE:</b> Users should be aware that even when this factory uses a single default
          • <code>Similarity</code> for some or all fields in a Query, the behavior can be inconsistent
          • with the behavior of explicitly configuring that same <code>Similarity</code> globally, because
          • of differences in how some multi-field / multi-clause behavior is defined in
          • <code>PerFieldSimilarityWrapper</code>. In particular please consider carefully the documentation
          • & implementation of {@link Similarity#coord}

            and

            {@link Similarity#queryNorm}

            in

          • {@link ClassicSimilarity}

            compared to

            {@link PerFieldSimilarityWrapper}
          • </p>
            }}

          which suggests that adding a SchemaSimilarityFactory will change the scoring, even if you continue to use the ClassicSimilarityFactory.

          Show
          Upayavira added a comment - I upgraded my Solr configs to match the 5.5, and the issue was still there. I eventually tracked down this snippet in Solr's SchemaSimilarityFactory.java: {{ <p> <b>NOTE:</b> Users should be aware that even when this factory uses a single default <code>Similarity</code> for some or all fields in a Query, the behavior can be inconsistent with the behavior of explicitly configuring that same <code>Similarity</code> globally, because of differences in how some multi-field / multi-clause behavior is defined in <code>PerFieldSimilarityWrapper</code>. In particular please consider carefully the documentation & implementation of {@link Similarity#coord} and {@link Similarity#queryNorm} in {@link ClassicSimilarity} compared to {@link PerFieldSimilarityWrapper} </p> }} which suggests that adding a SchemaSimilarityFactory will change the scoring, even if you continue to use the ClassicSimilarityFactory.
          Hide
          Adrien Grand added a comment -

          Upayavira I see... the normalization factor is computed by the per-field similarity but then it is the responsibility of the PerFieldSimilarityWrapper to normalize and the default implementation does not normalize. I opened SOLR-9315.

          Show
          Adrien Grand added a comment - Upayavira I see... the normalization factor is computed by the per-field similarity but then it is the responsibility of the PerFieldSimilarityWrapper to normalize and the default implementation does not normalize. I opened SOLR-9315 .
          Hide
          Upayavira added a comment -

          I applied your patch and my problems went away. Many thanks!!

          Show
          Upayavira added a comment - I applied your patch and my problems went away. Many thanks!!

            People

            • Assignee:
              Unassigned
              Reporter:
              Adrien Grand
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development