Lucene - Core
  1. Lucene - Core
  2. LUCENE-3412

SloppyPhraseScorer returns non-deterministic results for queries with many repeats

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 3.2, 3.3, 4.0-ALPHA
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None

      Description

      Proximity queries with many repeats (four or more, based on my testing) return non-deterministic results. I run the same query multiple times with the same data set and get different results.

      So far I've reproduced this with Solr 1.4.1, 3.1, 3.2, 3.3, and latest 4.0 trunk.

      Steps to reproduce (using the Solr example):
      1) In solrconfig.xml, set queryResultCache size to 0.
      2) Add some documents with text "dog dog dog" and "dog dog dog dog". http://localhost:8983/solr/update?stream.body=%3Cadd%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E1%3C/field%3E%3Cfield%20name=%22text%22%3Edog%20dog%20dog%3C/field%3E%3C/doc%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E2%3C/field%3E%3Cfield%20name=%22text%22%3Edog%20dog%20dog%20dog%3C/field%3E%3C/doc%3E%3C/add%3E&commit=true
      3) Do a "dog dog dog dog"~1 query. http://localhost:8983/solr/select?q=%22dog%20dog%20dog%20dog%22~1
      4) Repeat step 3 many times.

      Expected results: The document with id 2 should be returned.

      Actual results: The document with id 2 is always returned. The document with id 1 is sometimes returned.

      Different proximity values show the same bug - "dog dog dog dog"~5, "dog dog dog dog"~100, etc show the same behavior.

      So far I've traced it down to the "repeats" array in SloppyPhraseScorer.initPhrasePositions() - depending on the order of the elements in this array, the document may or may not match. I think the HashSet may be to blame, but I'm not sure - that at least seems to be where the non-determinism is coming from.

      1. LUCENE-3412.patch
        7 kB
        Doron Cohen
      2. LUCENE-3412.patch
        2 kB
        Doron Cohen

        Activity

        Hide
        Robert Muir added a comment -

        This issue could also be related to LUCENE-3215: in some cases with repeats, sloppy phrasescorer returns scores of Infinity... what scores are you getting?

        However, I don't think its a duplicate issue, with LUCENE-3215 the issue is when you have sloppyphrasequery + repeats + positionIncrements > 1 (e.g. stopwords and enablePositionIncrements=true, the default)

        Show
        Robert Muir added a comment - This issue could also be related to LUCENE-3215 : in some cases with repeats, sloppy phrasescorer returns scores of Infinity... what scores are you getting? However, I don't think its a duplicate issue, with LUCENE-3215 the issue is when you have sloppyphrasequery + repeats + positionIncrements > 1 (e.g. stopwords and enablePositionIncrements=true, the default)
        Hide
        Michael Ryan added a comment -

        Here's the debugQuery output from when it matched both docs:

        <lst name="explain"><str name="2">
        1.1890696 = (MATCH) weight(text:"dog dog dog dog"~1 in 1) [DefaultSimilarity], result of:
          1.1890696 = score(doc=1,freq=1.0 = phraseFreq=1.0
        ), product of:
            0.99999994 = queryWeight, product of:
              2.3781395 = idf(), sum of:
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
              0.42049676 = queryNorm
            1.1890697 = fieldWeight in 1, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = phraseFreq=1.0
              2.3781395 = idf(), sum of:
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
              0.5 = fieldNorm(doc=1)
        </str><str name="1">
        0.8407992 = (MATCH) weight(text:"dog dog dog dog"~1 in 0) [DefaultSimilarity], result of:
          0.8407992 = score(doc=0,freq=0.5 = phraseFreq=0.5
        ), product of:
            0.99999994 = queryWeight, product of:
              2.3781395 = idf(), sum of:
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
              0.42049676 = queryNorm
            0.8407993 = fieldWeight in 0, product of:
              0.70710677 = tf(freq=0.5), with freq of:
                0.5 = phraseFreq=0.5
              2.3781395 = idf(), sum of:
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
              0.5 = fieldNorm(doc=0)
        </str></lst>
        

        Sometimes when it matches both docs I'll get "no matching term" for the second one:

        <lst name="explain"><str name="2">
        1.1890696 = (MATCH) weight(text:"dog dog dog dog"~1 in 1) [DefaultSimilarity], result of:
          1.1890696 = score(doc=1,freq=1.0 = phraseFreq=1.0
        ), product of:
            0.99999994 = queryWeight, product of:
              2.3781395 = idf(), sum of:
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
              0.42049676 = queryNorm
            1.1890697 = fieldWeight in 1, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = phraseFreq=1.0
              2.3781395 = idf(), sum of:
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
                0.5945349 = idf(docFreq=2, maxDocs=2)
              0.5 = fieldNorm(doc=1)
        </str><str name="1">
        0.0 = (NON-MATCH) no matching term
        </str></lst>
        
        Show
        Michael Ryan added a comment - Here's the debugQuery output from when it matched both docs: <lst name="explain"><str name="2"> 1.1890696 = (MATCH) weight(text:"dog dog dog dog"~1 in 1) [DefaultSimilarity], result of: 1.1890696 = score(doc=1,freq=1.0 = phraseFreq=1.0 ), product of: 0.99999994 = queryWeight, product of: 2.3781395 = idf(), sum of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.42049676 = queryNorm 1.1890697 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 2.3781395 = idf(), sum of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) </str><str name="1"> 0.8407992 = (MATCH) weight(text:"dog dog dog dog"~1 in 0) [DefaultSimilarity], result of: 0.8407992 = score(doc=0,freq=0.5 = phraseFreq=0.5 ), product of: 0.99999994 = queryWeight, product of: 2.3781395 = idf(), sum of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.42049676 = queryNorm 0.8407993 = fieldWeight in 0, product of: 0.70710677 = tf(freq=0.5), with freq of: 0.5 = phraseFreq=0.5 2.3781395 = idf(), sum of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) </str></lst> Sometimes when it matches both docs I'll get "no matching term" for the second one: <lst name="explain"><str name="2"> 1.1890696 = (MATCH) weight(text:"dog dog dog dog"~1 in 1) [DefaultSimilarity], result of: 1.1890696 = score(doc=1,freq=1.0 = phraseFreq=1.0 ), product of: 0.99999994 = queryWeight, product of: 2.3781395 = idf(), sum of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.42049676 = queryNorm 1.1890697 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 2.3781395 = idf(), sum of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) </str><str name="1"> 0.0 = (NON-MATCH) no matching term </str></lst>
        Hide
        Doron Cohen added a comment -

        I am able to see this inconsistent behavior!

        Attached patch contains a test that fails on this. The test currently prints the trial number, and the first loop always pass in all 30 trials (expected) while the second loop always fail (for me) but is inconsistent about when it fails. Sometimes, it fails on the first iteration. Some other times it fails on the 3rd, 9th, etc.

        Quite peculiar... investigating...

        Show
        Doron Cohen added a comment - I am able to see this inconsistent behavior! Attached patch contains a test that fails on this. The test currently prints the trial number, and the first loop always pass in all 30 trials (expected) while the second loop always fail (for me) but is inconsistent about when it fails. Sometimes, it fails on the first iteration. Some other times it fails on the 3rd, 9th, etc. Quite peculiar... investigating...
        Hide
        Doron Cohen added a comment -

        Attached patch with fix to this bug.

        The fix is rather simple, - just process PP's in offset order. That is, when avoiding conflicts (a conflict means: more than a single query PP is landing on the same doc TP), make sure to handle PPs in a specific order: from first in query to last in query.

        This is crucial because the check for conflicts returns the PP with greater offset, and that one is advanced.

        It was pretty quick to fix this, but took longer to justify the fix.

        I added some explanations in the code so that next time justification would be faster and also renamed termPositionsDiffer() to termPositionsConflict() which more accurately describes the logic of that method.

        now need to see if this fix is also related to LUCENE-3215.

        Show
        Doron Cohen added a comment - Attached patch with fix to this bug. The fix is rather simple, - just process PP's in offset order. That is, when avoiding conflicts (a conflict means: more than a single query PP is landing on the same doc TP), make sure to handle PPs in a specific order: from first in query to last in query. This is crucial because the check for conflicts returns the PP with greater offset, and that one is advanced. It was pretty quick to fix this, but took longer to justify the fix. I added some explanations in the code so that next time justification would be faster and also renamed termPositionsDiffer() to termPositionsConflict() which more accurately describes the logic of that method. now need to see if this fix is also related to LUCENE-3215 .
        Hide
        Michael Ryan added a comment -

        Thanks, Doron. I've tried applying your patch to Solr 3.2 and it is working well so far.

        Show
        Michael Ryan added a comment - Thanks, Doron. I've tried applying your patch to Solr 3.2 and it is working well so far.
        Hide
        Doron Cohen added a comment -

        Thanks Michael for verifying this, I'll go ahead and commit.

        Show
        Doron Cohen added a comment - Thanks Michael for verifying this, I'll go ahead and commit.
        Hide
        Doron Cohen added a comment -

        Fix committed:

        • r1166541 - trunk
        • r1166563 - 3x

        (fix not included in 3.4 RC, therefore marked as 3.5 above)

        Show
        Doron Cohen added a comment - Fix committed: r1166541 - trunk r1166563 - 3x (fix not included in 3.4 RC, therefore marked as 3.5 above)
        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.5

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.5

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Michael Ryan
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development