Lucene - Core
  1. Lucene - Core
  2. LUCENE-295

[PATCH] MultiSearcher problems with Similarity.docFreq()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Operating System: other
      Platform: All

      Description

      When MultiSearcher invokes its subsearchers, it is the subsearchers' docFreq()
      that is accessed by Similarity.docFreq(). This causes idf's to be computed
      local to each index rather than globally, which causes ranking across multiple
      indices to not be equivalent to ranking across the entire global collection.

      The attached files (if I can figure out how to attach them) provide a potential
      partial solution for this. They properly fix a simple test case, RankingTest,
      that was provided by Daniel Naber.

      The changes are:
      1. Searcher: Add topmostSearcher() field with getter and setter to record
      the outermost Searcher. Default to this.
      2. MultiSearcher: Pass down the topmostSearcher when creating the subsearchers.
      3. IndexSearcher: Call Query.weight() everywhere with the topmostSearcher
      instead of this.
      4. Query: Provide a default implementation of Query.combine() so that
      MultiSearcher works with all queries.

      Problems or possible problems I see:
      1. This does not address the same issue with RemoteSearchable.
      RemoteSearchable is not a Searcher, nor can it be due to lack of multiple
      inheritance in Java, but Query.weight() requires a Searcher. Perhaps
      Query.weight() should be changed to take a Searchable, but this requires
      changing many places and I suspect would break apps.
      2. There may be other places that topmostSearcher should be used instead of this.
      3. The default implementation for Query.combine() is a guess on my part - it
      works for TermQuery. It's fragile in that the default implementation will hide
      bugs caused by queries that inadvertently omit a more precise Query.combine()
      method.
      4. The prior comment on Query.combine() indicates that whoever wrote it was
      fully aware of this problem and so probably had another usage in mind, so the
      whole issue may just be Daniel's usage in the test case. It's not apparent to
      me, so I probably don't understand something.

      1. ASF.LICENSE.NOT.GRANTED--multisearcher.diff
        14 kB
        Wolf Siberski
      2. ASF.LICENSE.NOT.GRANTED--combine-fix2.diff
        13 kB
        Wolf Siberski
      3. ASF.LICENSE.NOT.GRANTED--combine-fix.patch
        5 kB
        Chuck Williams
      4. ASF.LICENSE.NOT.GRANTED--multisearcher-deprecation.diff
        19 kB
        Wolf Siberski
      5. ASF.LICENSE.NOT.GRANTED--multisearcher-deprecation.diff
        20 kB
        Wolf Siberski
      6. ASF.LICENSE.NOT.GRANTED--multisearcher-2005-04-19.diff
        43 kB
        Otis Gospodnetic
      7. ASF.LICENSE.NOT.GRANTED--multisearcher-2005-02-22c.diff
        48 kB
        Wolf Siberski
      8. ASF.LICENSE.NOT.GRANTED--multisearcher-2005-02-18b.diff
        47 kB
        Wolf Siberski
      9. ASF.LICENSE.NOT.GRANTED--multisearcher.diff
        20 kB
        Wolf Siberski
      10. ASF.LICENSE.NOT.GRANTED--multisearcher2.diff
        13 kB
        Wolf Siberski
      11. ASF.LICENSE.NOT.GRANTED--multisearcher.diff
        2 kB
        Wolf Siberski
      12. ASF.LICENSE.NOT.GRANTED--patch.diff
        4 kB
        Daniel Naber
      13. ASF.LICENSE.NOT.GRANTED--MultiSearcherPatch.zip
        7 kB
        Chuck Williams

        Activity

        Hide
        Chuck Williams added a comment -

        Created an attachment (id=13175)
        Modified files for MultiSearcher / Similarity incompatibility

        Show
        Chuck Williams added a comment - Created an attachment (id=13175) Modified files for MultiSearcher / Similarity incompatibility
        Hide
        Daniel Naber added a comment -

        Created an attachment (id=13194)
        Chuck's changes as a diff

        Show
        Daniel Naber added a comment - Created an attachment (id=13194) Chuck's changes as a diff
        Hide
        Daniel Naber added a comment -

        Chuck, thanks for the files. I cannot judge if that's the correct/best way to
        fix the problem. I've attached a diff built from your new files so that
        someone will hopefully review your changes.

        Show
        Daniel Naber added a comment - Chuck, thanks for the files. I cannot judge if that's the correct/best way to fix the problem. I've attached a diff built from your new files so that someone will hopefully review your changes.
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=13413)
        Patch to MultiSearcher.java to correct idf calculation

        Show
        Wolf Siberski added a comment - Created an attachment (id=13413) Patch to MultiSearcher.java to correct idf calculation
        Hide
        Wolf Siberski added a comment -

        Here is another patch to correct the idf calculation. This patch currently is
        kind of proof of concept (fixes Daniel Nabers test case), but not ready for
        commitment, as it breaks ParallelMultiSearcher.java and several test cases.

        Instead of Chucks Williams patch, this one relies on a specialized Similarity
        object which is passed to the searchers. This causes less changes (only
        MultiSearcher is affected), and IMHO fits better into Lucene's design, because
        Similarity is responsible for idf calculation anyway, but the Searchers aren't.

        Problems:

        • The Similarity object can't be passed to RemoteSearchables, so the patch works
          only for Searchers.
        • MultiSearcher.search() methods aren't thread-safe anymore. The only way to
          provide the Searches with the new Similarity object is to use setSimilarity().
          This causes all searches to use it, not only the search which is currently
          running.

        A better solution would be to add the Similarity object to the Query. While ther
        e is already a similarity attribute in the Query, there is no Method to set it,
        and also interaction with subqueries is unclear. Such a change would therefore
        require a major redesign of similarity handling. If we make Similarity
        serializable, then we could pass the object to RemoteSearchers too, as part of
        the query. If I find time, I'll submit a complete patch in the future.

        Show
        Wolf Siberski added a comment - Here is another patch to correct the idf calculation. This patch currently is kind of proof of concept (fixes Daniel Nabers test case), but not ready for commitment, as it breaks ParallelMultiSearcher.java and several test cases. Instead of Chucks Williams patch, this one relies on a specialized Similarity object which is passed to the searchers. This causes less changes (only MultiSearcher is affected), and IMHO fits better into Lucene's design, because Similarity is responsible for idf calculation anyway, but the Searchers aren't. Problems: The Similarity object can't be passed to RemoteSearchables, so the patch works only for Searchers. MultiSearcher.search() methods aren't thread-safe anymore. The only way to provide the Searches with the new Similarity object is to use setSimilarity(). This causes all searches to use it, not only the search which is currently running. A better solution would be to add the Similarity object to the Query. While ther e is already a similarity attribute in the Query, there is no Method to set it, and also interaction with subqueries is unclear. Such a change would therefore require a major redesign of similarity handling. If we make Similarity serializable, then we could pass the object to RemoteSearchers too, as part of the query. If I find time, I'll submit a complete patch in the future.
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=13464)
        Complete patch to allow specification the Similarity on a query-by-query case

        This is a followup on my patch/comment from 2004-11-12. The patch attached now
        is complete, i.e. everything compiles and no test case is broken by it.
        The main idea is that a setSimilarity() method and a similarity attributed is
        added to Query. If the similarity is not set, the query uses the Searcher's
        similarity as before. However, if one sets the Similarity on the query, this
        one takes precedence.
        To solve the MultiSearcher issue, I have provided two different Similarities:

        • MultiSimilarity delegates the docFreq() and maxDoc() calls to the
          MultiSearcher, thus retrieving the sum over all registered searchers.
          This Similarity always 'gets it right', but obviously doesn't work with
          RemoteSearchables.
        • DfMapSimilarity analyses a query and caches all necessary docFreq values.
          This Similarity is Serializable and therefore works with RemoteSearchables,
          too. However, it is not able to handle queries where the term set is not known
          beforehand, e.g. wildcard queries.

        Both problems mentioned in my previous comment (thread-safety and remote
        searcher compatibility) are solved by this patch. All test cases work
        unchanged with the exception of one test case which had been tweaked
        previously due to the incorrect MultiSearcher and now works as expected
        (TestSort.testNormalizedScores()).

        Problems:

        • DfMapSimilarity.collectDfs() contains a lot of ugly casts to Query
          subclasses.
          This could be avoided by adding another abstract method to Query, but it is
          unclear if this is really the better solution.
        • In this patch the choice for DfMapSimilarity is hard-coded into
          MultiSearcher. This should be made configurable.
        Show
        Wolf Siberski added a comment - Created an attachment (id=13464) Complete patch to allow specification the Similarity on a query-by-query case This is a followup on my patch/comment from 2004-11-12. The patch attached now is complete, i.e. everything compiles and no test case is broken by it. The main idea is that a setSimilarity() method and a similarity attributed is added to Query. If the similarity is not set, the query uses the Searcher's similarity as before. However, if one sets the Similarity on the query, this one takes precedence. To solve the MultiSearcher issue, I have provided two different Similarities: MultiSimilarity delegates the docFreq() and maxDoc() calls to the MultiSearcher, thus retrieving the sum over all registered searchers. This Similarity always 'gets it right', but obviously doesn't work with RemoteSearchables. DfMapSimilarity analyses a query and caches all necessary docFreq values. This Similarity is Serializable and therefore works with RemoteSearchables, too. However, it is not able to handle queries where the term set is not known beforehand, e.g. wildcard queries. Both problems mentioned in my previous comment (thread-safety and remote searcher compatibility) are solved by this patch. All test cases work unchanged with the exception of one test case which had been tweaked previously due to the incorrect MultiSearcher and now works as expected (TestSort.testNormalizedScores()). Problems: DfMapSimilarity.collectDfs() contains a lot of ugly casts to Query subclasses. This could be avoided by adding another abstract method to Query, but it is unclear if this is really the better solution. In this patch the choice for DfMapSimilarity is hard-coded into MultiSearcher. This should be made configurable.
        Hide
        Daniel Naber added a comment -
            • Bug 32053 has been marked as a duplicate of this bug. ***
        Show
        Daniel Naber added a comment - Bug 32053 has been marked as a duplicate of this bug. ***
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=14242)
        New patch based on the discussion on lucene-dev (intermediate version)

        According to my understanding of our dev-lucene discussion Multisearcher should
        evaluate queries in the following way:
        1. rewrite queries using the Multisearch Similarity
        2. extract necessary terms
        3. collect idfs for these terms from the Searchables
        4. adjust query boosts according to collected idfs
        5. distribute re-written and re-boosted query to Searchables
        6. merge results

        This patch implements all steps except 4, where it uses a modified Similarity
        class instead. Thus it is fully functional, but not yet ready for commit.

        Show
        Wolf Siberski added a comment - Created an attachment (id=14242) New patch based on the discussion on lucene-dev (intermediate version) According to my understanding of our dev-lucene discussion Multisearcher should evaluate queries in the following way: 1. rewrite queries using the Multisearch Similarity 2. extract necessary terms 3. collect idfs for these terms from the Searchables 4. adjust query boosts according to collected idfs 5. distribute re-written and re-boosted query to Searchables 6. merge results This patch implements all steps except 4, where it uses a modified Similarity class instead. Thus it is fully functional, but not yet ready for commit.
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=14312)
        New patch based on the discussion on lucene-dev (complete)

        This is a complete version, from my point of view ready for prime time.

        Now a query can be 'frozen'. A frozen query always returns the most
        recently created Weight when calling Query.weight().

        MultiSearcher query processing is now done in the following steps:
        1. rewrite query
        2. extract necessary terms
        3. collect dfs for these terms from the Searchables
        4. create query weights using aggregate dfs and freeze query.
        5. distribute weighted and frozen query to Searchables
        6. merge results

        Show
        Wolf Siberski added a comment - Created an attachment (id=14312) New patch based on the discussion on lucene-dev (complete) This is a complete version, from my point of view ready for prime time. Now a query can be 'frozen'. A frozen query always returns the most recently created Weight when calling Query.weight(). MultiSearcher query processing is now done in the following steps: 1. rewrite query 2. extract necessary terms 3. collect dfs for these terms from the Searchables 4. create query weights using aggregate dfs and freeze query . 5. distribute weighted and frozen query to Searchables 6. merge results
        Hide
        cutting@apache.org added a comment -

        This looks good. Thanks!

        A few comments:

        Orignally there was no Weight in Lucene, only Query and Scorer. Weight was
        added in order to make it so that searching did not modify a Query, so that a
        Query instance could be reused. Searcher-dependent state of the query is meant
        to reside in the Weight. IndexReader dependent state resides in the Scorer.
        Your "freezing" a query violates this. Can't we create the weight once in
        Searcher.search?

        CachedDfSource does not need to be public does it?

        We need to think about back-compatibliity. Folks have implementations of Query,
        Weight, Similarity and Scorer. So, when a public API changes we need to
        deprecate, not remove, old methods, and try hard to make the old version still
        work. So, for example, we need to figure out how to handle the case where folks
        have implemented the old Similarity.idf() methods.

        You no longer call Query.getSimilarity(Searcher). That method permits queries
        to override the Searcher's Similarity implementation. Is there a reason you do
        this? We should be computing DFs once for the whole query tree, but it should
        still be possible to compute, e.g., IDFs independently per node, no?

        I also wonder if, instead of adding DocFreqSource we could instead still use the
        Searcher. MultiSearcher could keep an LRU cache of total doc freqs, implemented
        with LinkedHashMap, for the last few thousand search terms. That would be a far
        less invasive change, and hence less likely to break folks. Or am I missing
        something?

        Sorry if I seem picky, but this is core stuff in Lucene and affects a lot of people.

        Show
        cutting@apache.org added a comment - This looks good. Thanks! A few comments: Orignally there was no Weight in Lucene, only Query and Scorer. Weight was added in order to make it so that searching did not modify a Query, so that a Query instance could be reused. Searcher-dependent state of the query is meant to reside in the Weight. IndexReader dependent state resides in the Scorer. Your "freezing" a query violates this. Can't we create the weight once in Searcher.search? CachedDfSource does not need to be public does it? We need to think about back-compatibliity. Folks have implementations of Query, Weight, Similarity and Scorer. So, when a public API changes we need to deprecate, not remove, old methods, and try hard to make the old version still work. So, for example, we need to figure out how to handle the case where folks have implemented the old Similarity.idf() methods. You no longer call Query.getSimilarity(Searcher). That method permits queries to override the Searcher's Similarity implementation. Is there a reason you do this? We should be computing DFs once for the whole query tree, but it should still be possible to compute, e.g., IDFs independently per node, no? I also wonder if, instead of adding DocFreqSource we could instead still use the Searcher. MultiSearcher could keep an LRU cache of total doc freqs, implemented with LinkedHashMap, for the last few thousand search terms. That would be a far less invasive change, and hence less likely to break folks. Or am I missing something? Sorry if I seem picky, but this is core stuff in Lucene and affects a lot of people.
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=14344)
        Next version - sends weights instead of queries

        Thanks for the valuable feedback which has resulted in this new version. Now
        the query freezing approach is avoided by sending weights to the searchables
        instead of queries. Thus queries are still resusable, but the Searchable
        interface had to be extended. Also, previous API and behavior modifications
        have been reverted as far as possible.

        Show
        Wolf Siberski added a comment - Created an attachment (id=14344) Next version - sends weights instead of queries Thanks for the valuable feedback which has resulted in this new version. Now the query freezing approach is avoided by sending weights to the searchables instead of queries. Thus queries are still resusable, but the Searchable interface had to be extended. Also, previous API and behavior modifications have been reverted as far as possible.
        Hide
        cutting@apache.org added a comment -

        This looks great to me!

        +1

        Thanks again for patiently working through this rather extensive change.

        Show
        cutting@apache.org added a comment - This looks great to me! +1 Thanks again for patiently working through this rather extensive change.
        Hide
        Otis Gospodnetic added a comment -

        I applied this patch (2005-02-22 version) locally, but it doesn't compile as is.
        It looks like I need to add "throws IOException" to
        BooleanQuery.BooleanWeight2.createWeight(Searcher) methods, and also to
        MultiPhraseQuery.createWeight(Searcher) method.

        It looks like I also need to modify TermQuery.TermWeight and add a searcher
        declaration (private Searcher searcher, plus the searcher assignment to the
        constructor (this.searcher = searcher).

        After these 2-3 changes Lucene compile. Are my changes OK?

        Unfortunately, unit tests fail in a few places, and it looks like all failures
        have to do with RemoteSearchable. Re-reading the comments in this bug entry, it
        looks like this was a known issue at one point, but Wolf noted the following on
        2004-11-15:

        "Both problems mentioned in my previous comment (thread-safety and remote
        searcher compatibility) are solved by this patch. All test cases work
        unchanged with the exception of one test case which had been tweaked
        previously due to the incorrect MultiSearcher and now works as expected
        (TestSort.testNormalizedScores())."

        This sounds like tests involving RemoteSearchable should pass now.

        Show
        Otis Gospodnetic added a comment - I applied this patch (2005-02-22 version) locally, but it doesn't compile as is. It looks like I need to add "throws IOException" to BooleanQuery.BooleanWeight2.createWeight(Searcher) methods, and also to MultiPhraseQuery.createWeight(Searcher) method. It looks like I also need to modify TermQuery.TermWeight and add a searcher declaration (private Searcher searcher , plus the searcher assignment to the constructor (this.searcher = searcher). After these 2-3 changes Lucene compile. Are my changes OK? Unfortunately, unit tests fail in a few places, and it looks like all failures have to do with RemoteSearchable. Re-reading the comments in this bug entry, it looks like this was a known issue at one point, but Wolf noted the following on 2004-11-15: "Both problems mentioned in my previous comment (thread-safety and remote searcher compatibility) are solved by this patch. All test cases work unchanged with the exception of one test case which had been tweaked previously due to the incorrect MultiSearcher and now works as expected (TestSort.testNormalizedScores())." This sounds like tests involving RemoteSearchable should pass now.
        Hide
        Otis Gospodnetic added a comment -

        Changing MultiSearcher to implement Serializable makes all unit tests pass again.

        I'll wait a bit more before committing this in order to give people time to comment.

        Show
        Otis Gospodnetic added a comment - Changing MultiSearcher to implement Serializable makes all unit tests pass again. I'll wait a bit more before committing this in order to give people time to comment.
        Hide
        Paul Elschot added a comment -

        (In reply to comment #13)
        > I applied this patch (2005-02-22 version) locally, but it doesn't compile as
        is.
        > It looks like I need to add "throws IOException" to
        > BooleanQuery.BooleanWeight2.createWeight(Searcher) methods, and also to
        > MultiPhraseQuery.createWeight(Searcher) method.

        That should be no problem for BooleanWeight2.createWeight(Searcher).

        > It looks like I also need to modify TermQuery.TermWeight and add a searcher
        > declaration (private Searcher searcher, plus the searcher assignment to the
        > constructor (this.searcher = searcher).
        >
        > After these 2-3 changes Lucene compile. Are my changes OK?

        I think so.

        Regards,
        Paul Elschot

        Show
        Paul Elschot added a comment - (In reply to comment #13) > I applied this patch (2005-02-22 version) locally, but it doesn't compile as is. > It looks like I need to add "throws IOException" to > BooleanQuery.BooleanWeight2.createWeight(Searcher) methods, and also to > MultiPhraseQuery.createWeight(Searcher) method. That should be no problem for BooleanWeight2.createWeight(Searcher). > It looks like I also need to modify TermQuery.TermWeight and add a searcher > declaration (private Searcher searcher , plus the searcher assignment to the > constructor (this.searcher = searcher). > > After these 2-3 changes Lucene compile. Are my changes OK? I think so. Regards, Paul Elschot
        Hide
        Chuck Williams added a comment -

        I made these comments last night, but somehow they aren't here. Sorry if a
        duplicate pops up in the wrong place somewhere...

        I don't think all of the changes are correct and feel we need Wolf to comment on
        this as his message states that he ran the unit tests successfully on his code
        as submitted. Specifically, it looks to me like:
        1. The additions of throws IOException to Weights are correct, and due to new
        Weights added in parallel to Wolf's work.
        2. I do not believe adding the searcher field back to TermWeight is correct.
        Wolf explicitly pulled this field out and replaced all references to it with
        references to similarity or idf. What caused the field to be added back?
        3. Making MultiSearcher implement Serializable seems strange – I don't
        understand why in the protocol this would be requried. It does not seem right.

        Chuck

        Show
        Chuck Williams added a comment - I made these comments last night, but somehow they aren't here. Sorry if a duplicate pops up in the wrong place somewhere... I don't think all of the changes are correct and feel we need Wolf to comment on this as his message states that he ran the unit tests successfully on his code as submitted. Specifically, it looks to me like: 1. The additions of throws IOException to Weights are correct, and due to new Weights added in parallel to Wolf's work. 2. I do not believe adding the searcher field back to TermWeight is correct. Wolf explicitly pulled this field out and replaced all references to it with references to similarity or idf. What caused the field to be added back? 3. Making MultiSearcher implement Serializable seems strange – I don't understand why in the protocol this would be requried. It does not seem right. Chuck
        Hide
        Chuck Williams added a comment -

        Here is what I think happened and, if so, the correct fix:
        1. MultiPhraseQuery and the new BooleanQuery were both added in parallel to
        Wolf's work. We now have a merge conflict.
        2. The addition of the throws IOException to these two classes is now
        required for Weights, and so the change is correct.
        3. The new protocol for Weight's is to NOT have a searcher field. Instead,
        they should keep a Similarity field. Just like the other Weight's that Wolf's
        patch addresses, these two classes only use the searcher field to access the
        Similarity and/or the idf. MultiPhraseQuery, which needs the idf, already has
        an idf field. So, identical to Wolf's other changes it is straightforward to
        modify these two classes to eliminate the searcher field.
        4. Eliminating the searcher field will elimninate the need for MultiSearcher
        to be Serializable, therefore retracting that change.

        I think that's right, and that the best fix is too conform the new Weights to
        the new protocol for MultiSearcher compatibility. I've also sent a message
        directly to Wolf asking him to review this.

        Chuck

        Show
        Chuck Williams added a comment - Here is what I think happened and, if so, the correct fix: 1. MultiPhraseQuery and the new BooleanQuery were both added in parallel to Wolf's work. We now have a merge conflict. 2. The addition of the throws IOException to these two classes is now required for Weights, and so the change is correct. 3. The new protocol for Weight's is to NOT have a searcher field. Instead, they should keep a Similarity field. Just like the other Weight's that Wolf's patch addresses, these two classes only use the searcher field to access the Similarity and/or the idf. MultiPhraseQuery, which needs the idf, already has an idf field. So, identical to Wolf's other changes it is straightforward to modify these two classes to eliminate the searcher field. 4. Eliminating the searcher field will elimninate the need for MultiSearcher to be Serializable, therefore retracting that change. I think that's right, and that the best fix is too conform the new Weights to the new protocol for MultiSearcher compatibility. I've also sent a message directly to Wolf asking him to review this. Chuck
        Hide
        Chuck Williams added a comment -

        One more note on this. I think conforming to the new Weight protocol, i.e.
        eliminating the searcher field from BooleanWeight, BooleanWeight2, and
        MultiPhraseWeight, is quite important. This is because the Weight is now sent
        to RemoteSearchables instead of the Query. If Weights have a searcher field,
        then this could be quite expensive.

        Chuck

        Show
        Chuck Williams added a comment - One more note on this. I think conforming to the new Weight protocol, i.e. eliminating the searcher field from BooleanWeight, BooleanWeight2, and MultiPhraseWeight, is quite important. This is because the Weight is now sent to RemoteSearchables instead of the Query. If Weights have a searcher field, then this could be quite expensive. Chuck
        Hide
        Paul Elschot added a comment -

        (In reply to comment #17)
        > Here is what I think happened and, if so, the correct fix:
        > 1. MultiPhraseQuery and the new BooleanQuery were both added in parallel
        to
        > Wolf's work. We now have a merge conflict.
        > 2. The addition of the throws IOException to these two classes is now
        > required for Weights, and so the change is correct.

        It could well be that there is a conflict with the new BooleanQuery and
        BooleanScorer. Given the backward compatibility goal of this new code, I'd be
        happy to resolve that possible conflict.

        Over the last months, I have kept all the changes to BooleanQuery/BooleanScorer
        in a local working copy, and I had only one or two conflicts. These were
        very easy to resolve. Bugzilla works well in this respect: all changes have
        some comments associated with them, and, from what I have read so far,
        there are no conflicting intentions for the functionality of this code.
        So, please go ahead as you see fit for the MultiSearcher. I think you
        can safely ignore evt. conflicts with new code for BooleanQuery/BooleanScorer.

        Regards,
        Paul Elschot

        Show
        Paul Elschot added a comment - (In reply to comment #17) > Here is what I think happened and, if so, the correct fix: > 1. MultiPhraseQuery and the new BooleanQuery were both added in parallel to > Wolf's work. We now have a merge conflict. > 2. The addition of the throws IOException to these two classes is now > required for Weights, and so the change is correct. It could well be that there is a conflict with the new BooleanQuery and BooleanScorer. Given the backward compatibility goal of this new code, I'd be happy to resolve that possible conflict. Over the last months, I have kept all the changes to BooleanQuery/BooleanScorer in a local working copy, and I had only one or two conflicts. These were very easy to resolve. Bugzilla works well in this respect: all changes have some comments associated with them, and, from what I have read so far, there are no conflicting intentions for the functionality of this code. So, please go ahead as you see fit for the MultiSearcher. I think you can safely ignore evt. conflicts with new code for BooleanQuery/BooleanScorer. Regards, Paul Elschot
        Hide
        Wolf Siberski added a comment -

        Sorry for the late reply; I'm currently traveling and had no Internet access.
        Chucks analysis is completely correct. Serializing a (Multi-)Searcher doesn't
        make sense, instead the Similarity has to be passed with the weight.

        What I don't understand is why the patch failed. I checked the diff file again,
        and the necessary changes appear to be in it. For example, line 89 of the diff
        contains the addition of 'throws IOException' to BooleanWeight2.createWeight(),
        and the corresponding change for MultiPhraseQueryWeight is on line 346. The
        changes required for TermWeight are at starting at line 1187. Could you please
        check if everything went ok when applying the patch, by comparing manually to
        the diff file when errors occur somewhere?

        Show
        Wolf Siberski added a comment - Sorry for the late reply; I'm currently traveling and had no Internet access. Chucks analysis is completely correct. Serializing a (Multi-)Searcher doesn't make sense, instead the Similarity has to be passed with the weight. What I don't understand is why the patch failed. I checked the diff file again, and the necessary changes appear to be in it. For example, line 89 of the diff contains the addition of 'throws IOException' to BooleanWeight2.createWeight(), and the corresponding change for MultiPhraseQueryWeight is on line 346. The changes required for TermWeight are at starting at line 1187. Could you please check if everything went ok when applying the patch, by comparing manually to the diff file when errors occur somewhere?
        Hide
        Otis Gospodnetic added a comment -

        Created an attachment (id=14767)
        local diffs that I'll apply

        This is what the final diff is like. I'll apply it shortly.

        Show
        Otis Gospodnetic added a comment - Created an attachment (id=14767) local diffs that I'll apply This is what the final diff is like. I'll apply it shortly.
        Hide
        Otis Gospodnetic added a comment -

        Committed, unit tests pass.

        Show
        Otis Gospodnetic added a comment - Committed, unit tests pass.
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=14784)
        Additional patch for deprecation issue

        This patch removes all references to deprecated methods in Searchable,
        and fixes some other minor issues.

        Show
        Wolf Siberski added a comment - Created an attachment (id=14784) Additional patch for deprecation issue This patch removes all references to deprecated methods in Searchable, and fixes some other minor issues.
        Hide
        cutting@apache.org added a comment -

        I am now confused by the changes in Searcher.java. Why is sometimes
        query.weight(Searcher) called and other times query.createWeight()? The latter
        is meant only to only construct a Weight, and the former to construct and
        initialize it. Shouldn't it always be initialized?

        Also, it was difficult to apply your patch. It uses CRLF instead of LF, which
        confuses patch on Linux.

        Show
        cutting@apache.org added a comment - I am now confused by the changes in Searcher.java. Why is sometimes query.weight(Searcher) called and other times query.createWeight()? The latter is meant only to only construct a Weight, and the former to construct and initialize it. Shouldn't it always be initialized? Also, it was difficult to apply your patch. It uses CRLF instead of LF, which confuses patch on Linux.
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=14841)
        Additional patch for deprecation issue - corrected

        This was just an oversight. I've replaced the remaining calls to query.weight()
        in Searcher with Searcher.createWeight() and corrected that method so that it
        calls query.weight() now.

        Show
        Wolf Siberski added a comment - Created an attachment (id=14841) Additional patch for deprecation issue - corrected This was just an oversight. I've replaced the remaining calls to query.weight() in Searcher with Searcher.createWeight() and corrected that method so that it calls query.weight() now.
        Hide
        cutting@apache.org added a comment -

        I have applied the deprecation patch.

        The solution to my patch difficulties was to use 'patch -l -F 4'. This gets
        around the end-of-line issues.

        Thanks again, Wolf!

        Show
        cutting@apache.org added a comment - I have applied the deprecation patch. The solution to my patch difficulties was to use 'patch -l -F 4'. This gets around the end-of-line issues. Thanks again, Wolf!
        Hide
        Chuck Williams added a comment -

        Created an attachment (id=14846)
        [PATCH] Fix to Query.combine() method and all specializations

        This fixes the bugs in Query.combine() that were uncovered by the failing test
        in the Highlighter. Only Query.combine() remains – all overrides in
        BooleanQuery, RangeQuery, MultiTermQuery and PrefixQuery are deleted. I
        believe this fix is correct, robust realtive to possible user Query
        implementations, and generates optimal queries for at least the cases that are
        built-in to Lucene (query rewriting of MultTermQuery's and RangeQuery's). This
        is more robust relative to possible user Query implementations and covers more
        optimizations cases than the version I sent via email last night. With this
        patch, all Lucene and Highlighter tests pass (with the exception of the buggy
        TestSort.testNormalizedScores() which should be fixed by Wolf's patch).

        Show
        Chuck Williams added a comment - Created an attachment (id=14846) [PATCH] Fix to Query.combine() method and all specializations This fixes the bugs in Query.combine() that were uncovered by the failing test in the Highlighter. Only Query.combine() remains – all overrides in BooleanQuery, RangeQuery, MultiTermQuery and PrefixQuery are deleted. I believe this fix is correct, robust realtive to possible user Query implementations, and generates optimal queries for at least the cases that are built-in to Lucene (query rewriting of MultTermQuery's and RangeQuery's). This is more robust relative to possible user Query implementations and covers more optimizations cases than the version I sent via email last night. With this patch, all Lucene and Highlighter tests pass (with the exception of the buggy TestSort.testNormalizedScores() which should be fixed by Wolf's patch).
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=14854)
        Slightly extended Fix to Query.combine()

        This is an extension/modification of Chucks patch.
        Besides his changes, it contains more test cases (in TestMultiSearcherRanking)
        and minor additional modifications to make these tests work:

        • if the resulting query consists of only one clause, return that clause
          directly instead of wrapping it into a BooleanQuery.
        • BooleanQuery.equals doesn't take clause order into account anymore
          I also have rewritten the loop logic of Query.combine to use a flag instead
          of a labeled break, but this is more a matter of taste.
        Show
        Wolf Siberski added a comment - Created an attachment (id=14854) Slightly extended Fix to Query.combine() This is an extension/modification of Chucks patch. Besides his changes, it contains more test cases (in TestMultiSearcherRanking) and minor additional modifications to make these tests work: if the resulting query consists of only one clause, return that clause directly instead of wrapping it into a BooleanQuery. BooleanQuery.equals doesn't take clause order into account anymore I also have rewritten the loop logic of Query.combine to use a flag instead of a labeled break, but this is more a matter of taste.
        Hide
        Chuck Williams added a comment -

        Wolf's revisions to my changes to Query.combine() look fine. The single-query
        optimization is good – my oversight to have not included it originally. I
        don't believe either of the other two changes is necessary, but they are correct:
        1. Using a flag instead of the labelled loop is a matter of style as Wolf
        says, and it's a little less efficent (the biggest effect could be remedied by
        one more if (splittable) to avoid unnecessarily copying the clauses of a
        BooleanQuery where coord is not disabled).
        2. Changing BooleanQuery equality to be independent of clause order is
        semantically correct, although again it is a little less efficient. It's only
        purpose is to stop a false-negative in the new tests.

        Regarding additional test cases, it would be helpful to add the cases I was
        concerned about, which are situations where a query can rewrite into different
        kinds of fundamental queries depending on the reader. I believe the only case
        where this occurs with the built-in queries is with MultiTermQuery's and
        RangeQuery's (where the rewrite depends on how many query clauses are generated
        by each reader), and we have covered those cases. The coord testing in
        Query.combine() is designed to handle the case where some query rewrites into a
        different kind of BooleanQuery (e.g., an AND), in some readers and not others.
        Nothing tests this at present. A single-term BooleanQuery OR could rewrite into
        a BooleanQuery AND, but this would be independent of reader.

        Many additional optimizations could be added. It seems redundant to have
        optimizations here and in the rewrite mechanism. Since we are down to just
        Query.combine(), only called from one place, I think a better fix is to change
        MultiSearcher to pass the reader as well. Then Query.combine() could construct
        the straightforward BooleanQuery and rewrite it. All the optimizations would
        then go into a single place, the rewrite methods. Wolf, what do you think of
        that approach?

        Show
        Chuck Williams added a comment - Wolf's revisions to my changes to Query.combine() look fine. The single-query optimization is good – my oversight to have not included it originally. I don't believe either of the other two changes is necessary, but they are correct: 1. Using a flag instead of the labelled loop is a matter of style as Wolf says, and it's a little less efficent (the biggest effect could be remedied by one more if (splittable) to avoid unnecessarily copying the clauses of a BooleanQuery where coord is not disabled). 2. Changing BooleanQuery equality to be independent of clause order is semantically correct, although again it is a little less efficient. It's only purpose is to stop a false-negative in the new tests. Regarding additional test cases, it would be helpful to add the cases I was concerned about, which are situations where a query can rewrite into different kinds of fundamental queries depending on the reader. I believe the only case where this occurs with the built-in queries is with MultiTermQuery's and RangeQuery's (where the rewrite depends on how many query clauses are generated by each reader), and we have covered those cases. The coord testing in Query.combine() is designed to handle the case where some query rewrites into a different kind of BooleanQuery (e.g., an AND), in some readers and not others. Nothing tests this at present. A single-term BooleanQuery OR could rewrite into a BooleanQuery AND, but this would be independent of reader. Many additional optimizations could be added. It seems redundant to have optimizations here and in the rewrite mechanism. Since we are down to just Query.combine(), only called from one place, I think a better fix is to change MultiSearcher to pass the reader as well. Then Query.combine() could construct the straightforward BooleanQuery and rewrite it. All the optimizations would then go into a single place, the rewrite methods. Wolf, what do you think of that approach?
        Hide
        Wolf Siberski added a comment -

        Created an attachment (id=15481)
        Updated fix to Query.combine()

        This is a slightly updated version of my previous patch, taking Chucks comments
        into account and based on the current HEAD.

        Show
        Wolf Siberski added a comment - Created an attachment (id=15481) Updated fix to Query.combine() This is a slightly updated version of my previous patch, taking Chucks comments into account and based on the current HEAD.
        Hide
        Wolf Siberski added a comment -
            • Bug 35241 has been marked as a duplicate of this bug. ***
        Show
        Wolf Siberski added a comment - Bug 35241 has been marked as a duplicate of this bug. ***
        Hide
        Daniel Naber added a comment -

        There might be one corner case that your change in equals doesn't get right
        (not tested): a query "a b a" (i.e. one clause occuring twice – doesn't make
        much sense, but has an influence on the result order) would equal "a b b",
        wouldn't it? Is that a problem?

        Show
        Daniel Naber added a comment - There might be one corner case that your change in equals doesn't get right (not tested): a query "a b a" (i.e. one clause occuring twice – doesn't make much sense, but has an influence on the result order) would equal "a b b", wouldn't it? Is that a problem?
        Hide
        Wolf Siberski added a comment -

        I've started a separate thread on BooleanQuery semantics on the dev mailing
        list. You could still apply the patch and omit the change to
        BooleanQuery.equals(). This just leads to warning outputs in
        TestMultiSearcherRanking.checkQuery(), and if you want to avoid these, simply
        delete the last 10 lines of TestMultiSearcherRanking.checkQuery(). They contain
        an additional, not really necessary test.

        Show
        Wolf Siberski added a comment - I've started a separate thread on BooleanQuery semantics on the dev mailing list. You could still apply the patch and omit the change to BooleanQuery.equals(). This just leads to warning outputs in TestMultiSearcherRanking.checkQuery(), and if you want to avoid these, simply delete the last 10 lines of TestMultiSearcherRanking.checkQuery(). They contain an additional, not really necessary test.
        Hide
        Daniel Naber added a comment -

        Thanks, I've committed your patch, leaving out the change to equals() and not
        removing mergeBooleanQueries() as that is a public method which someone might
        be using. Also, could you please check if the test case is correct now? I
        couldn't apply that part of your patch cleanly and something might have been
        broken.

        Show
        Daniel Naber added a comment - Thanks, I've committed your patch, leaving out the change to equals() and not removing mergeBooleanQueries() as that is a public method which someone might be using. Also, could you please check if the test case is correct now? I couldn't apply that part of your patch cleanly and something might have been broken.
        Hide
        Wolf Siberski added a comment -

        Thanks for applying the patch. Everything is fine from my point of view, except
        one minor issue: You are right that Query.mergeBooleanQueries() can't just be
        removed, as it is public. However, I doubt if it is still useful. Could we mark
        it as deprecated, and suggest a call to super.combine() as replacement?

        Show
        Wolf Siberski added a comment - Thanks for applying the patch. Everything is fine from my point of view, except one minor issue: You are right that Query.mergeBooleanQueries() can't just be removed, as it is public. However, I doubt if it is still useful. Could we mark it as deprecated, and suggest a call to super.combine() as replacement?
        Hide
        Daniel Naber added a comment -

        Should combine() maybe be static? Then deprecating mergeBooleanQueries() seems
        to be the obvious way to go.

        Show
        Daniel Naber added a comment - Should combine() maybe be static? Then deprecating mergeBooleanQueries() seems to be the obvious way to go.
        Hide
        Daniel Naber added a comment -

        Okay, I see that there are errors in the test cases if one makes
        Query.combine() static. But does it have to be public?

        Show
        Daniel Naber added a comment - Okay, I see that there are errors in the test cases if one makes Query.combine() static. But does it have to be public?
        Hide
        Wolf Siberski added a comment -

        Currently, Query.combine() is only called by MultiSearcher and a Highlighter
        test case. If I understand it correctly, the intent of this method was to give
        Query subclasses a hook to modify the combination algorithm. This hook is not
        used by Query subclasses which are part of the Lucene distribution, and I would
        be very astonished if someone used it outside of Lucene. Such an implementation
        would break now anyway, because BooleanQueries are only handled correctly in the
        new Query.combine[] method, but not in mergeBooleanQueries() (which
        reimplementations are supposed to call as helper method).

        IMHO the cleanest solution would be to remove combine() and
        mergeBooleanQueries() from Query and move the query combination logic to
        MultiSearcher. Of course, that would be an incompatible API change, and we can't
        be 100% sure that no one needs the combine() hook. On the other hand, we now
        have useless code in mergeBooleanQueries() and code which is only used by
        MultiSearcher in combine(), and that also isn't a desirable situation.

        Show
        Wolf Siberski added a comment - Currently, Query.combine() is only called by MultiSearcher and a Highlighter test case. If I understand it correctly, the intent of this method was to give Query subclasses a hook to modify the combination algorithm. This hook is not used by Query subclasses which are part of the Lucene distribution, and I would be very astonished if someone used it outside of Lucene. Such an implementation would break now anyway, because BooleanQueries are only handled correctly in the new Query.combine[] method, but not in mergeBooleanQueries() (which reimplementations are supposed to call as helper method). IMHO the cleanest solution would be to remove combine() and mergeBooleanQueries() from Query and move the query combination logic to MultiSearcher. Of course, that would be an incompatible API change, and we can't be 100% sure that no one needs the combine() hook. On the other hand, we now have useless code in mergeBooleanQueries() and code which is only used by MultiSearcher in combine(), and that also isn't a desirable situation.

          People

          • Assignee:
            Lucene Developers
            Reporter:
            Chuck Williams
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development