Lucene - Core
  1. Lucene - Core
  2. LUCENE-5409

ToParentBlockJoinCollector.getTopGroups returns empty Groups

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.6
    • Fix Version/s: 4.7, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Ubuntu 12.04

    • Lucene Fields:
      New

      Description

      A bug is observed to cause unstable results returned by the getTopGroups function of class ToParentBlockJoinCollector.

      In the scorer generation stage, the ToParentBlockJoinCollector will automatically rewrite all the associated ToParentBlockJoinQuery (and their subqueries), and save them into its in-memory Look-up table, namely joinQueryID (see enroll() method for detail). Unfortunately, in the getTopGroups method, the new ToParentBlockJoinQuery parameter is not rewritten (at least users are not expected to do so). When the new one is searched in the old lookup table (considering the impact of rewrite() on hashCode()), the lookup will largely fail and eventually end up with a topGroup collection consisting of only empty groups (their hitCounts are guaranteed to be zero).

      An easy fix would be to rewrite the original BlockJoinQuery before invoking getTopGroups method. However, the computational cost of this is not optimal. A better but slightly more complex solution would be to save unrewrited Queries into the lookup table.

        Activity

        Hide
        Michael McCandless added a comment -

        Hi Peng, this is a good catch! Probably we could key on the .origChildQuery since we already carry that one along.

        Do you have a test case showing the issue?

        Show
        Michael McCandless added a comment - Hi Peng, this is a good catch! Probably we could key on the .origChildQuery since we already carry that one along. Do you have a test case showing the issue?
        Hide
        Peng Cheng added a comment -

        Hi Michael, I'll be working on the test case this evening.
        Thanks a lot for the confirmation. I've read your book.

        Show
        Peng Cheng added a comment - Hi Michael, I'll be working on the test case this evening. Thanks a lot for the confirmation. I've read your book.
        Hide
        Peng Cheng added a comment -

        I was trying to work on a test case but have encountered the following problem:

        The bug will only be triggered if a ToParentBlockJoinQuery can be rewritten into another query with a different hashcode (Uwe said this is a common situation). But I've experimented with several simple Query and they all have identical hashcode after the rewriting. The query that failed in my project was a very long CustomScoreQuery (used for feature engineering in text analysis), I wouldn't imagine to put that into the unit test. So can you show me an example of a compound query that doesn't preserve hashcode? I can finish other works.

        Show
        Peng Cheng added a comment - I was trying to work on a test case but have encountered the following problem: The bug will only be triggered if a ToParentBlockJoinQuery can be rewritten into another query with a different hashcode (Uwe said this is a common situation). But I've experimented with several simple Query and they all have identical hashcode after the rewriting. The query that failed in my project was a very long CustomScoreQuery (used for feature engineering in text analysis), I wouldn't imagine to put that into the unit test. So can you show me an example of a compound query that doesn't preserve hashcode? I can finish other works.
        Hide
        Michael McCandless added a comment -

        Hmm, maybe a BooleanQuery containing just a single TermQuery?

        Or, maybe create a custom Query for this test that wraps another query but has its own "unusual" hashCode() impl...

        Show
        Michael McCandless added a comment - Hmm, maybe a BooleanQuery containing just a single TermQuery? Or, maybe create a custom Query for this test that wraps another query but has its own "unusual" hashCode() impl...
        Hide
        Peng Cheng added a comment - - edited

        Still doesn't work, this is my test case, perhaps I still need to implement a Query with custom rewrite and hashcode

        public void testRewriteHash() throws IOException

        { final Directory dir = newDirectory(); final RandomIndexWriter w = new RandomIndexWriter(random(), dir); final List<Document> docs = new ArrayList<Document>(); docs.add(makeJob("java", 2007)); docs.add(makeJob("python", 2010)); docs.add(makeResume("Lisa", "United Kingdom")); w.addDocuments(docs); docs.clear(); docs.add(makeJob("ruby", 2005)); docs.add(makeJob("java", 2006)); docs.add(makeResume("Frank", "United States")); w.addDocuments(docs); IndexReader r = w.getReader(); w.close(); IndexSearcher s = newSearcher(r); Query q = new TermQuery(new Term("skill","rudy")); BooleanQuery qc = new BooleanQuery(); qc.add(q,Occur.SHOULD); Filter parentsFilter = new FixedBitSetCachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("docType", "resume")))); Query qp = new ToParentBlockJoinQuery(qc,parentsFilter,ScoreMode.Max); int hash1 = qp.hashCode(); Query qw = s.rewrite(qp); int hash2 = qp.hashCode(); assertTrue(hash1 != hash2); }
        Show
        Peng Cheng added a comment - - edited Still doesn't work, this is my test case, perhaps I still need to implement a Query with custom rewrite and hashcode public void testRewriteHash() throws IOException { final Directory dir = newDirectory(); final RandomIndexWriter w = new RandomIndexWriter(random(), dir); final List<Document> docs = new ArrayList<Document>(); docs.add(makeJob("java", 2007)); docs.add(makeJob("python", 2010)); docs.add(makeResume("Lisa", "United Kingdom")); w.addDocuments(docs); docs.clear(); docs.add(makeJob("ruby", 2005)); docs.add(makeJob("java", 2006)); docs.add(makeResume("Frank", "United States")); w.addDocuments(docs); IndexReader r = w.getReader(); w.close(); IndexSearcher s = newSearcher(r); Query q = new TermQuery(new Term("skill","rudy")); BooleanQuery qc = new BooleanQuery(); qc.add(q,Occur.SHOULD); Filter parentsFilter = new FixedBitSetCachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("docType", "resume")))); Query qp = new ToParentBlockJoinQuery(qc,parentsFilter,ScoreMode.Max); int hash1 = qp.hashCode(); Query qw = s.rewrite(qp); int hash2 = qp.hashCode(); assertTrue(hash1 != hash2); }
        Hide
        Michael McCandless added a comment -

        Interesting ... because the hashCode for the TermQuery and the BooleanQuery do in fact differ.

        OK I see what's happening: ToParentBJQ.hashCode uses origChildQuery to compute its hashcode (and in .equals); so ... rewrite won't change it, by design.

        Hmm, which then seems like you should not be hitting an issue here? Maybe try to boil down your problem to a small test case?

        Show
        Michael McCandless added a comment - Interesting ... because the hashCode for the TermQuery and the BooleanQuery do in fact differ. OK I see what's happening: ToParentBJQ.hashCode uses origChildQuery to compute its hashcode (and in .equals); so ... rewrite won't change it, by design. Hmm, which then seems like you should not be hitting an issue here? Maybe try to boil down your problem to a small test case?
        Hide
        Peng Cheng added a comment -

        That's a good point. but I also speculate that ToParentBJQ.rewrite() will change origChildQuery:
        Query rewritten = new ToParentBlockJoinQuery(childQuery,
        childRewrite,
        parentsFilter,
        scoreMode);
        if rewrite is executed >twice, the origChildQuery won't be the same.

        Show
        Peng Cheng added a comment - That's a good point. but I also speculate that ToParentBJQ.rewrite() will change origChildQuery: Query rewritten = new ToParentBlockJoinQuery(childQuery, childRewrite, parentsFilter, scoreMode); if rewrite is executed >twice, the origChildQuery won't be the same.
        Hide
        Peng Cheng added a comment -

        Wow I just changed rewritten code to use origChildQuery:
        Query rewritten = new ToParentBlockJoinQuery(origChildQuery,
        childRewrite,
        parentsFilter,
        scoreMode);
        and it passed all unit test, I never thought the fix being that easy. Thank you so much for pointing it out!
        Also, is it possible to close the trivial LUCENE-5398?

        Show
        Peng Cheng added a comment - Wow I just changed rewritten code to use origChildQuery: Query rewritten = new ToParentBlockJoinQuery(origChildQuery, childRewrite, parentsFilter, scoreMode); and it passed all unit test, I never thought the fix being that easy. Thank you so much for pointing it out! Also, is it possible to close the trivial LUCENE-5398 ?
        Hide
        Michael McCandless added a comment -

        Ahh ... two rewrites will store the wrong origChildQuery. Nice catch

        But: do we first have a failing test case here?

        Also, were you rewriting yourself externally in your application? Or is there some path through Lucene that results in more than one rewrite?

        Show
        Michael McCandless added a comment - Ahh ... two rewrites will store the wrong origChildQuery. Nice catch But: do we first have a failing test case here? Also, were you rewriting yourself externally in your application? Or is there some path through Lucene that results in more than one rewrite?
        Hide
        Peng Cheng added a comment -

        Not yet, I'm trying to find a special case.
        I never rewrite myself (so does most of lucene users) but IndexSearcher.rewrite method used in search(...) will rewrite any query recursively until it reaches the final stage.

        Show
        Peng Cheng added a comment - Not yet, I'm trying to find a special case. I never rewrite myself (so does most of lucene users) but IndexSearcher.rewrite method used in search(...) will rewrite any query recursively until it reaches the final stage.
        Hide
        Michael McCandless added a comment -

        I never rewrite myself (so does most of lucene users) but IndexSearcher.rewrite method used in search(...) will rewrite any query recursively until it reaches the final stage.

        Ahh, right.

        But still I'm confused how this would tickle the bug. Clearly the bug is there, and we should be passing origChildQuery as the first arg when we rewrite, but I'd still like a small test case here showing the issue, somehow ...

        Show
        Michael McCandless added a comment - I never rewrite myself (so does most of lucene users) but IndexSearcher.rewrite method used in search(...) will rewrite any query recursively until it reaches the final stage. Ahh, right. But still I'm confused how this would tickle the bug. Clearly the bug is there, and we should be passing origChildQuery as the first arg when we rewrite, but I'd still like a small test case here showing the issue, somehow ...
        Hide
        Peng Cheng added a comment -

        Finally got your test case: it only appears in larger scale, this is really excruciating as I'm not a software architect.

        To run the failed test case, please apply the attached patch or manually copy the unit test function into testBlockJoin.java

        Show
        Peng Cheng added a comment - Finally got your test case: it only appears in larger scale, this is really excruciating as I'm not a software architect. To run the failed test case, please apply the attached patch or manually copy the unit test function into testBlockJoin.java
        Hide
        Peng Cheng added a comment -

        patch file

        Show
        Peng Cheng added a comment - patch file
        Hide
        Michael McCandless added a comment -

        Wow, thanks Peng! I'll have a look.

        Show
        Michael McCandless added a comment - Wow, thanks Peng! I'll have a look.
        Hide
        ASF subversion and git services added a comment -

        Commit 1562124 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1562124 ]

        LUCENE-5409: fix ToParentBlockJoinQuery.rewrite to correctly preserve the origChildQuery after more than one iteration of rewrite

        Show
        ASF subversion and git services added a comment - Commit 1562124 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1562124 ] LUCENE-5409 : fix ToParentBlockJoinQuery.rewrite to correctly preserve the origChildQuery after more than one iteration of rewrite
        Hide
        ASF subversion and git services added a comment -

        Commit 1562125 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1562125 ]

        LUCENE-5409: fix ToParentBlockJoinQuery.rewrite to correctly preserve the origChildQuery after more than one iteration of rewrite

        Show
        ASF subversion and git services added a comment - Commit 1562125 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562125 ] LUCENE-5409 : fix ToParentBlockJoinQuery.rewrite to correctly preserve the origChildQuery after more than one iteration of rewrite
        Hide
        Michael McCandless added a comment -

        Thanks Peng, I just committed this. I tweaked the test a bit to reduce the number of indexed docs ...

        Show
        Michael McCandless added a comment - Thanks Peng, I just committed this. I tweaked the test a bit to reduce the number of indexed docs ...
        Hide
        Peng Cheng added a comment -

        Pleasure to be of service, also thanks Aaron Colak and Long Yao at sciencescape.net for support.
        Do I need a test case to resolve LUCENE-5398?

        Show
        Peng Cheng added a comment - Pleasure to be of service, also thanks Aaron Colak and Long Yao at sciencescape.net for support. Do I need a test case to resolve LUCENE-5398 ?

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Peng Cheng
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development