Lucene - Core
  1. Lucene - Core
  2. LUCENE-3068

The repeats mechanism in SloppyPhraseScorer is broken when doc has tokens at same position

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.3, 3.1, 4.0-ALPHA
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-736 we made fixes to SloppyPhraseScorer, because it was
      matching docs that it shouldn't; but I think those changes caused it
      to fail to match docs that it should, specifically when the doc itself
      has tokens at the same position.

      1. LUCENE-3068.patch
        11 kB
        Doron Cohen
      2. LUCENE-3068.patch
        9 kB
        Doron Cohen
      3. LUCENE-3068.patch
        3 kB
        Doron Cohen
      4. LUCENE-3068.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch w/ test case showing the problem.

        If you set slop to 0 for the PhraseQuery, the test passes. The MultiPhraseQuery passes with slop or no slop because it handles the same-position case itself (Union*Enum).

        That got me thinking... maybe any time a *PhraseQuery has overlapping positions, we should rewrite to a MultiPhraseQuery and let it handle the same positions...? Is there any downside to that?

        Show
        Michael McCandless added a comment - Patch w/ test case showing the problem. If you set slop to 0 for the PhraseQuery, the test passes. The MultiPhraseQuery passes with slop or no slop because it handles the same-position case itself (Union*Enum). That got me thinking... maybe any time a *PhraseQuery has overlapping positions, we should rewrite to a MultiPhraseQuery and let it handle the same positions...? Is there any downside to that?
        Hide
        Doron Cohen added a comment -

        specifically when the doc itself has tokens at the same position.

        I am not convinced yet that there is a bug here - I think the code does allow this?

        There is another assumption in the code, that any two different PPs are in different TPs - which underlines the assumption that originally each PP differs in position, This seems a valid assumption, because QP will create MFQ if there are two terms in the (phrase) query with same position.

        maybe any time a *PhraseQuery has overlapping positions, we should rewrite to a MultiPhraseQuery and let it handle the same positions...? Is there any downside to that?

        I think this is the correct behavior - in particular this will be the query that a QP will create. The only way to create a PQ (not MPQ) for PPs in same positions is to create it manually. But why would anyone do that? And they did, wouldn't such a rewrite be a surprise to them?

        A patch to follow with a revised version of this test - one that uses the QP. In this patch the QP indeed creates an MFQ, and I am yet unable to make it fail. Still trying.

        Show
        Doron Cohen added a comment - specifically when the doc itself has tokens at the same position. I am not convinced yet that there is a bug here - I think the code does allow this? There is another assumption in the code, that any two different PPs are in different TPs - which underlines the assumption that originally each PP differs in position, This seems a valid assumption, because QP will create MFQ if there are two terms in the (phrase) query with same position. maybe any time a *PhraseQuery has overlapping positions, we should rewrite to a MultiPhraseQuery and let it handle the same positions...? Is there any downside to that? I think this is the correct behavior - in particular this will be the query that a QP will create. The only way to create a PQ (not MPQ) for PPs in same positions is to create it manually. But why would anyone do that? And they did, wouldn't such a rewrite be a surprise to them? A patch to follow with a revised version of this test - one that uses the QP. In this patch the QP indeed creates an MFQ, and I am yet unable to make it fail. Still trying.
        Hide
        Doron Cohen added a comment -

        Attached modified version of the test - one that invokes the query parser to create an MFQ. The test passes.

        Show
        Doron Cohen added a comment - Attached modified version of the test - one that invokes the query parser to create an MFQ. The test passes.
        Hide
        Doron Cohen added a comment -

        This is more complex than I originally thought.

        1. QueryParser creates a MultiplePhraseQuery (MPQ) when one of the (phrase) query positions is a multi-term.
        2. MPQ has an implicit OR behavior - it is used for e.g. wildcarding a phrase query.
        3. PhraseQuery (PQ) sloppy scorer assumes each query position has a single term.
        4. PQ with several terms in same position cannot be created by parsing it with a QP, only manual.
          Manually created, it would have an AND semantics: only docs with ALL the terms in pos N should match.
          In other words, assume doc D terms and positions are:
          a:0 b:1 c:1 d:2
          MPQ for (a,b):0 d:1 should match D, finding the phrase b:1 d:2 (OR semantics)
          PQ for (a,b):0 d:1 should not match D, because it does not contain 'a' and 'b' in the same position (AND semantics).

        Therefore, rewriting PQ into MPQ is not a valid fix, because it would turn the AND logic assumed by creating the PQ this way, by an OR logic as assumed in MPQ.

        TestPositionIncrement.testSetPosition has a test for this case exactly
            // phrase query should fail for non existing searched term 
            // even if there exist another searched terms in the same searched position. 
            q = new PhraseQuery();
            q.add(new Term("field", "3"),0);
            q.add(new Term("field", "9"),0);
            hits = searcher.search(q, null, 1000).scoreDocs;
            assertEquals(0, hits.length);
        

        Although QP by default will not create this PQ, I think we need to support it, for applications needing to be strict with the search results, with slop.

        So fixing this would need to take place inside SloppyScorer, digging further...

        Show
        Doron Cohen added a comment - This is more complex than I originally thought. QueryParser creates a MultiplePhraseQuery (MPQ) when one of the (phrase) query positions is a multi-term. MPQ has an implicit OR behavior - it is used for e.g. wildcarding a phrase query. PhraseQuery (PQ) sloppy scorer assumes each query position has a single term. PQ with several terms in same position cannot be created by parsing it with a QP, only manual. Manually created, it would have an AND semantics: only docs with ALL the terms in pos N should match. In other words, assume doc D terms and positions are: a:0 b:1 c:1 d:2 MPQ for (a,b):0 d:1 should match D, finding the phrase b:1 d:2 (OR semantics) PQ for (a,b):0 d:1 should not match D, because it does not contain 'a' and 'b' in the same position (AND semantics). Therefore, rewriting PQ into MPQ is not a valid fix, because it would turn the AND logic assumed by creating the PQ this way, by an OR logic as assumed in MPQ. TestPositionIncrement.testSetPosition has a test for this case exactly // phrase query should fail for non existing searched term // even if there exist another searched terms in the same searched position. q = new PhraseQuery(); q.add( new Term( "field" , "3" ),0); q.add( new Term( "field" , "9" ),0); hits = searcher.search(q, null , 1000).scoreDocs; assertEquals(0, hits.length); Although QP by default will not create this PQ, I think we need to support it, for applications needing to be strict with the search results, with slop. So fixing this would need to take place inside SloppyScorer, digging further...
        Hide
        Doron Cohen added a comment -

        Attached patch fixes this bug by excluding fro the repeats check those PPs originated fro same offset in the query.

        This allows more strict phrase queries: strict on terms in same position (AND logic) but still sloppy.

        All tests pass, this is ready to go in (unless there are reservations).

        Show
        Doron Cohen added a comment - Attached patch fixes this bug by excluding fro the repeats check those PPs originated fro same offset in the query. This allows more strict phrase queries: strict on terms in same position (AND logic) but still sloppy. All tests pass, this is ready to go in (unless there are reservations).
        Hide
        Shai Erera added a comment -

        Patch looks good to me.

        One comment about the test - perhaps use the LTC methods that do random tests, like newDirectory(), newIndexWriterConfig() etc.? If you don't think it's appropriate for this test, that's ok with me.

        Show
        Shai Erera added a comment - Patch looks good to me. One comment about the test - perhaps use the LTC methods that do random tests, like newDirectory(), newIndexWriterConfig() etc.? If you don't think it's appropriate for this test, that's ok with me.
        Hide
        Doron Cohen added a comment -

        Thanks for reviewing Shai!
        I'll updated the patch with random newDirectory and newICFG - not the focus here, but may improve coverage anyhow,
        I added tests for the combined case - some AND some OR - that is, using MPQ, some add() with a single term (AND), some with an array longer than 1 (OR). Also refactored the tests a bit so that now there's a small test method for each test case.

        Show
        Doron Cohen added a comment - Thanks for reviewing Shai! I'll updated the patch with random newDirectory and newICFG - not the focus here, but may improve coverage anyhow, I added tests for the combined case - some AND some OR - that is, using MPQ, some add() with a single term (AND), some with an array longer than 1 (OR). Also refactored the tests a bit so that now there's a small test method for each test case.
        Hide
        Doron Cohen added a comment -

        Patch with more test cases - AND/OR logic for MPQ is combined, and test code made simpler.

        Show
        Doron Cohen added a comment - Patch with more test cases - AND/OR logic for MPQ is combined, and test code made simpler.
        Hide
        Doron Cohen added a comment -

        fixed in trunk in r1124293.

        Show
        Doron Cohen added a comment - fixed in trunk in r1124293.
        Hide
        Doron Cohen added a comment -

        fix merged to 3x in r1124302.

        Show
        Doron Cohen added a comment - fix merged to 3x in r1124302.
        Hide
        Doron Cohen added a comment -

        I wonder if this should be fixed also in 3.1 branch?
        Probably so only if we make a 3.1.1, but not needed if its gonna be a 3.2.
        What's the best practice then? Reopen until decision?
        Or rely on rescanning all 3.2 changes in case its gonna be 3.1.1?

        Show
        Doron Cohen added a comment - I wonder if this should be fixed also in 3.1 branch? Probably so only if we make a 3.1.1, but not needed if its gonna be a 3.2. What's the best practice then? Reopen until decision? Or rely on rescanning all 3.2 changes in case its gonna be 3.1.1?
        Hide
        Doron Cohen added a comment -

        Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this is a great tool!) I see no particular slowdown at the last runs.

        A thought about these benchmarks, it would be helpful if the checked revision would be shown - perhaps as part of the hover text when hovering the mouse on a graph point...

        Show
        Doron Cohen added a comment - Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this is a great tool!) I see no particular slowdown at the last runs. A thought about these benchmarks, it would be helpful if the checked revision would be shown - perhaps as part of the hover text when hovering the mouse on a graph point...
        Hide
        Michael McCandless added a comment -

        A thought about these benchmarks, it would be helpful if the checked revision would be shown - perhaps as part of the hover text when hovering the mouse on a graph point..

        Good idea! I'll try to do this...

        Note that if you go back to the root page, and click on a given day, it tells you the svn rev and also hg ref (of luceneutil), so that's a [cumbersome] way to get the svn rev.

        Show
        Michael McCandless added a comment - A thought about these benchmarks, it would be helpful if the checked revision would be shown - perhaps as part of the hover text when hovering the mouse on a graph point.. Good idea! I'll try to do this... Note that if you go back to the root page, and click on a given day, it tells you the svn rev and also hg ref (of luceneutil), so that's a [cumbersome] way to get the svn rev.
        Hide
        Doron Cohen added a comment -

        Note that if you go back to the root page, and click on a given day, it tells you the svn rev and also hg ref (of luceneutil)

        Great, thanks!

        So, this commit to trunk in r1124293 falls between these two:

        • Tue 17/05/2011 Lucene/Solr trunk rev 1104671
        • Wed 18/05/2011 Lucene/Solr trunk rev 1124524

        ... No measurable degradation, good!

        Show
        Doron Cohen added a comment - Note that if you go back to the root page, and click on a given day, it tells you the svn rev and also hg ref (of luceneutil) Great, thanks! So, this commit to trunk in r1124293 falls between these two: Tue 17/05/2011 Lucene/Solr trunk rev 1104671 Wed 18/05/2011 Lucene/Solr trunk rev 1124524 ... No measurable degradation, good!
        Hide
        Simon Willnauer added a comment -

        Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this is a great tool!) I see no particular slowdown at the last runs.

        I love it! good that all the work on LuceneUtil pays off!!!!!

        Show
        Simon Willnauer added a comment - Looking at http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html (Mike this is a great tool!) I see no particular slowdown at the last runs. I love it! good that all the work on LuceneUtil pays off!!!!!
        Hide
        Robert Muir added a comment -

        Bulk closing for 3.2

        Show
        Robert Muir added a comment - Bulk closing for 3.2

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development