Lucene - Core
  1. Lucene - Core
  2. LUCENE-736

Sloppy Phrase Scorer matches the doc "A B C D E" for query = "B C B"~2

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      This is an extension of https://issues.apache.org/jira/browse/LUCENE-697

      In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring.

      1) A phrase with a repeated word would be detected in a document although it is not there.
      I.e. document = A B D C E , query = "B C B" would not find this document (as expected), but query "B C B"~2 would find it.
      I think that no matter how large the slop is, this document should not be a match.

      2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form.
      I.e. document = A B C B A would score differently for queries "B C"~2 and "C B"~2, although it is symmetric to both.

      I will attach test cases that show both these problems and the one reported by Yonik in 697.

      1. perf-search-new.log
        184 kB
        Doron Cohen
      2. perf-search-orig.log
        184 kB
        Doron Cohen
      3. res-search-new2.log
        363 kB
        Doron Cohen
      4. res-search-orig2.log
        364 kB
        Doron Cohen
      5. sloppy_phrase_java.patch.txt
        12 kB
        Doron Cohen
      6. sloppy_phrase_tests.patch.txt
        12 kB
        Doron Cohen
      7. sloppy_phrase.patch2.txt
        25 kB
        Doron Cohen
      8. sloppy_phrase.patch3.txt
        30 kB
        Doron Cohen
      9. sloppy_phrase.patch3.txt
        27 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          Doron Cohen added a comment -

          sloppy_phrase_tests.patch.txt contains:

          • two test cases added in TestPhraseQuery.
            These new tests currently fail.
          • skipTo() behavior tests that were originaly in issue 697.
            This too currently fails.
          Show
          Doron Cohen added a comment - sloppy_phrase_tests.patch.txt contains: two test cases added in TestPhraseQuery. These new tests currently fail. skipTo() behavior tests that were originaly in issue 697. This too currently fails.
          Hide
          Doron Cohen added a comment -

          Attached sloppy_phrase_java.patch.txt is fixing the failing new tests.
          This also includes the skipTo() bug from issue 697.

          The fix does not guarantee that document A B C B A would score "A B C"~4 and "C B A"~4 the same.
          It does that for "B C"~2 and "C B"~2.
          This is because a general fix for that (at least the one that I devised) would be too expensive.
          Although this is an interesting case, I'd like to think it is not an important one.

          This fix comes with a performance cost: about 15% degradation in CPU activity of sloppy phrase scoring, as the attcahed perf logs show.
          Here is the summary of these tests:

          .......Operation..........runCnt...recsPerRun.....rec/s..elapsedSec
          Orig:..SearchSameRdr_3000......4.........3000.....216.1.......55.52
          New:...SearchSameRdr_3000......4.........3000.....187.8.......63.91

          I think that in a real life scenario - real index, real documents, real queries - this extra CPU will be shaded by IO, but I also belive we should refrain from slowing down search, so, unhappy with this degradation (anyone would, I would look for a other ways to fix this - ideas are welcome.

          Perf test was done using the task benchmark framework (see issue 675), The logs show also the queries that were searched.

          All tests pass with new code.

          Show
          Doron Cohen added a comment - Attached sloppy_phrase_java.patch.txt is fixing the failing new tests. This also includes the skipTo() bug from issue 697. The fix does not guarantee that document A B C B A would score "A B C"~4 and "C B A"~4 the same. It does that for "B C"~2 and "C B"~2. This is because a general fix for that (at least the one that I devised) would be too expensive. Although this is an interesting case, I'd like to think it is not an important one. This fix comes with a performance cost: about 15% degradation in CPU activity of sloppy phrase scoring, as the attcahed perf logs show. Here is the summary of these tests: .......Operation..........runCnt...recsPerRun.....rec/s..elapsedSec Orig:..SearchSameRdr_3000......4.........3000.....216.1.......55.52 New:...SearchSameRdr_3000......4.........3000.....187.8.......63.91 I think that in a real life scenario - real index, real documents, real queries - this extra CPU will be shaded by IO, but I also belive we should refrain from slowing down search, so, unhappy with this degradation (anyone would , I would look for a other ways to fix this - ideas are welcome. Perf test was done using the task benchmark framework (see issue 675), The logs show also the queries that were searched. All tests pass with new code.
          Hide
          Yonik Seeley added a comment -

          Great investigations Doron!
          Personally I'm more concerned with (1) than (2). Was the fix for one issue more responsible for the performance loss than the other?

          Show
          Yonik Seeley added a comment - Great investigations Doron! Personally I'm more concerned with (1) than (2). Was the fix for one issue more responsible for the performance loss than the other?
          Hide
          Paul Elschot added a comment -

          I have had similar concerns when I implemented NearSpansOrdered.java and NearSpansUnordered.java,
          which are in the trunk now.
          These match somewhat different phrases, but it would be good to ensure that the same matches score
          the same for spans and phrases.

          Show
          Paul Elschot added a comment - I have had similar concerns when I implemented NearSpansOrdered.java and NearSpansUnordered.java, which are in the trunk now. These match somewhat different phrases, but it would be good to ensure that the same matches score the same for spans and phrases.
          Hide
          Doron Cohen added a comment -

          The change to fix case 2 was not the main performance degradation cause.

          I agree with Yonik that case 2 is much more important than case 1.
          So I modified to fix case 2 but not case 1.
          Also extended the perf test to create also the "reversed" form of the sloppy phrases (slop increased for reversed cases so that queries would match docs).

          Cost of this fix dropped from 15% more CPU time to about 3%.
          I feel ok with this.

          .....Operation..........runCnt...recsPerRun...rec/s..elapsedSec....avgUsedMem....avgTotalMem
          Orig.SearchSameRdr_6000......4.........6000...194.2......123.59.....8,032,732.....11,333,632
          New..SearchSameRdr_6000......4.........6000...187.5......128.02.....8,172,258.....11,333,632

          Attached sloppy_phrase.patch2.txt has the updated fix, including both java and test parts. Some of the asserts in the new tests were commented out cause the patch takes decision not to fix case 1 above.

          Also attaching the updates perf test logs - res-search-orig2.log and res-search-new2.log.

          I did not compare scoring of similar cases between sloppy phrase and near spans and Paul suggested - perhaps next week - not sure this should hold progress with this issue.

          Show
          Doron Cohen added a comment - The change to fix case 2 was not the main performance degradation cause. I agree with Yonik that case 2 is much more important than case 1. So I modified to fix case 2 but not case 1. Also extended the perf test to create also the "reversed" form of the sloppy phrases (slop increased for reversed cases so that queries would match docs). Cost of this fix dropped from 15% more CPU time to about 3%. I feel ok with this. .....Operation..........runCnt...recsPerRun...rec/s..elapsedSec....avgUsedMem....avgTotalMem Orig.SearchSameRdr_6000......4.........6000...194.2......123.59.....8,032,732.....11,333,632 New..SearchSameRdr_6000......4.........6000...187.5......128.02.....8,172,258.....11,333,632 Attached sloppy_phrase.patch2.txt has the updated fix, including both java and test parts. Some of the asserts in the new tests were commented out cause the patch takes decision not to fix case 1 above. Also attaching the updates perf test logs - res-search-orig2.log and res-search-new2.log. I did not compare scoring of similar cases between sloppy phrase and near spans and Paul suggested - perhaps next week - not sure this should hold progress with this issue.
          Hide
          Paul Elschot added a comment -

          There is no need to hold up this issue for span phrases.
          Perhaps a good way to get the spans and the phrases work well together is by adding a getSpans() to
          PhraseQuery, or by introduction of a SpanPhraseQuery. But this would better be done at a new jira issue.

          Show
          Paul Elschot added a comment - There is no need to hold up this issue for span phrases. Perhaps a good way to get the spans and the phrases work well together is by adding a getSpans() to PhraseQuery, or by introduction of a SpanPhraseQuery. But this would better be done at a new jira issue.
          Hide
          Doron Cohen added a comment -

          There is a bug in my recent patch (sloppy_phrase.patch2.txt):

          • for the case of phrase with repetitions, some additional computation is required before starting each doc.
          • this does not affect the regular/common case of phrase with no repetitions.
            I extended the test to expose this and will commit an updated patch later today.
          Show
          Doron Cohen added a comment - There is a bug in my recent patch (sloppy_phrase.patch2.txt): for the case of phrase with repetitions, some additional computation is required before starting each doc. this does not affect the regular/common case of phrase with no repetitions. I extended the test to expose this and will commit an updated patch later today.
          Hide
          Doron Cohen added a comment -

          Test case - testNonExistingWrappedPhrase - was extended.
          A bug in the patch (described above) was fixed.

          All tests pass.

          Show
          Doron Cohen added a comment - Test case - testNonExistingWrappedPhrase - was extended. A bug in the patch (described above) was fixed. All tests pass.
          Hide
          Otis Gospodnetic added a comment -

          Doron, sounds like this is ripe for a commit now to take care of both this and LUCENE-697.

          Show
          Otis Gospodnetic added a comment - Doron, sounds like this is ripe for a commit now to take care of both this and LUCENE-697 .
          Hide
          Doron Cohen added a comment -

          Need to see if the parts of the test (in QueryUtils) that were disabled by LUCENE-730 (BooleanScorer2 sometimes falls back to BooleanScorer). One possibility is to have two versions of this - a BooleanScoere version, and the rest - this issue (736) is about sloppy/exact phrase scoring, so it would fall into the "rest", and so the test would still catch this.

          Show
          Doron Cohen added a comment - Need to see if the parts of the test (in QueryUtils) that were disabled by LUCENE-730 (BooleanScorer2 sometimes falls back to BooleanScorer). One possibility is to have two versions of this - a BooleanScoere version, and the rest - this issue (736) is about sloppy/exact phrase scoring, so it would fall into the "rest", and so the test would still catch this.
          Hide
          Doron Cohen added a comment -

          Changing the title to match what we decided to fix here.

          Show
          Doron Cohen added a comment - Changing the title to match what we decided to fix here.
          Hide
          Doron Cohen added a comment -

          Fixed.

          • Search time cost of this fix is 4%, for sloppy phrase search.
          • This fix also partially brings back the the tests checkSkipTo() in QueryUtils,
            (which was disabled by LUCENE-730), but this is now invoked only for non Boolean scorers.
          Show
          Doron Cohen added a comment - Fixed. Search time cost of this fix is 4%, for sloppy phrase search. This fix also partially brings back the the tests checkSkipTo() in QueryUtils, (which was disabled by LUCENE-730 ), but this is now invoked only for non Boolean scorers.
          Hide
          Doron Cohen added a comment -

          Attaching for any future reference the fix that was applied for this.

          Show
          Doron Cohen added a comment - Attaching for any future reference the fix that was applied for this.

            People

            • Assignee:
              Doron Cohen
              Reporter:
              Doron Cohen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development