Issue Details (XML | Word | Printable)

Key: LUCENE-736
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Doron Cohen
Reporter: Doron Cohen
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Sloppy Phrase Scorer matches the doc "A B C D E" for query = "B C B"~2

Created: 01/Dec/06 09:13 AM   Updated: 19/Jun/07 08:14 AM
Return to search
Component/s: Search
Affects Version/s: None
Fix Version/s: 2.2

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works perf-search-new.log 2006-12-01 10:24 AM Doron Cohen 184 kB
Text File Licensed for inclusion in ASF works perf-search-orig.log 2006-12-01 10:24 AM Doron Cohen 184 kB
Text File Licensed for inclusion in ASF works res-search-new2.log 2006-12-02 08:33 AM Doron Cohen 363 kB
Text File Licensed for inclusion in ASF works res-search-orig2.log 2006-12-02 08:33 AM Doron Cohen 364 kB
Text File Licensed for inclusion in ASF works sloppy_phrase.patch2.txt 2006-12-02 08:33 AM Doron Cohen 25 kB
Text File Licensed for inclusion in ASF works sloppy_phrase.patch3.txt 2007-04-24 05:39 AM Doron Cohen 30 kB
Text File Licensed for inclusion in ASF works sloppy_phrase.patch3.txt 2006-12-04 09:49 PM Doron Cohen 27 kB
Text File Licensed for inclusion in ASF works sloppy_phrase_java.patch.txt 2006-12-01 10:24 AM Doron Cohen 12 kB
Text File Licensed for inclusion in ASF works sloppy_phrase_tests.patch.txt 2006-12-01 09:24 AM Doron Cohen 12 kB
Issue Links:
Reference
 

Lucene Fields: Patch Available
Resolution Date: 24/Apr/07 05:37 AM


 Description  « Hide
This is an extension of https://issues.apache.org/jira/browse/LUCENE-697

In addition to abnormalities Yonik pointed out in 697, there seem to be other issues with slopy phrase search and scoring.

1) A phrase with a repeated word would be detected in a document although it is not there.
I.e. document = A B D C E , query = "B C B" would not find this document (as expected), but query "B C B"~2 would find it.
I think that no matter how large the slop is, this document should not be a match.

2) A document containing both orders of a query, symmetrically, would score differently for the queru and for its reveresed form.
I.e. document = A B C B A would score differently for queries "B C"~2 and "C B"~2, although it is symmetric to both.

I will attach test cases that show both these problems and the one reported by Yonik in 697.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doron Cohen added a comment - 01/Dec/06 09:24 AM
sloppy_phrase_tests.patch.txt contains:
  • two test cases added in TestPhraseQuery.
    These new tests currently fail.
  • skipTo() behavior tests that were originaly in issue 697.
    This too currently fails.

Doron Cohen added a comment - 01/Dec/06 10:24 AM
Attached sloppy_phrase_java.patch.txt is fixing the failing new tests.
This also includes the skipTo() bug from issue 697.

The fix does not guarantee that document A B C B A would score "A B C"~4 and "C B A"~4 the same.
It does that for "B C"~2 and "C B"~2.
This is because a general fix for that (at least the one that I devised) would be too expensive.
Although this is an interesting case, I'd like to think it is not an important one.

This fix comes with a performance cost: about 15% degradation in CPU activity of sloppy phrase scoring, as the attcahed perf logs show.
Here is the summary of these tests:

.......Operation..........runCnt...recsPerRun.....rec/s..elapsedSec
Orig:..SearchSameRdr_3000......4.........3000.....216.1.......55.52
New:...SearchSameRdr_3000......4.........3000.....187.8.......63.91

I think that in a real life scenario - real index, real documents, real queries - this extra CPU will be shaded by IO, but I also belive we should refrain from slowing down search, so, unhappy with this degradation (anyone would, I would look for a other ways to fix this - ideas are welcome.

Perf test was done using the task benchmark framework (see issue 675), The logs show also the queries that were searched.

All tests pass with new code.


Yonik Seeley added a comment - 01/Dec/06 05:04 PM
Great investigations Doron!
Personally I'm more concerned with (1) than (2). Was the fix for one issue more responsible for the performance loss than the other?

Paul Elschot added a comment - 01/Dec/06 07:02 PM
I have had similar concerns when I implemented NearSpansOrdered.java and NearSpansUnordered.java,
which are in the trunk now.
These match somewhat different phrases, but it would be good to ensure that the same matches score
the same for spans and phrases.

Doron Cohen added a comment - 02/Dec/06 08:33 AM
The change to fix case 2 was not the main performance degradation cause.

I agree with Yonik that case 2 is much more important than case 1.
So I modified to fix case 2 but not case 1.
Also extended the perf test to create also the "reversed" form of the sloppy phrases (slop increased for reversed cases so that queries would match docs).

Cost of this fix dropped from 15% more CPU time to about 3%.
I feel ok with this.

.....Operation..........runCnt...recsPerRun...rec/s..elapsedSec....avgUsedMem....avgTotalMem
Orig.SearchSameRdr_6000......4.........6000...194.2......123.59.....8,032,732.....11,333,632
New..SearchSameRdr_6000......4.........6000...187.5......128.02.....8,172,258.....11,333,632

Attached sloppy_phrase.patch2.txt has the updated fix, including both java and test parts. Some of the asserts in the new tests were commented out cause the patch takes decision not to fix case 1 above.

Also attaching the updates perf test logs - res-search-orig2.log and res-search-new2.log.

I did not compare scoring of similar cases between sloppy phrase and near spans and Paul suggested - perhaps next week - not sure this should hold progress with this issue.


Paul Elschot added a comment - 02/Dec/06 09:46 AM
There is no need to hold up this issue for span phrases.
Perhaps a good way to get the spans and the phrases work well together is by adding a getSpans() to
PhraseQuery, or by introduction of a SpanPhraseQuery. But this would better be done at a new jira issue.

Doron Cohen added a comment - 04/Dec/06 09:08 PM
There is a bug in my recent patch (sloppy_phrase.patch2.txt):
  • for the case of phrase with repetitions, some additional computation is required before starting each doc.
  • this does not affect the regular/common case of phrase with no repetitions.
    I extended the test to expose this and will commit an updated patch later today.

Doron Cohen added a comment - 04/Dec/06 09:49 PM
Test case - testNonExistingWrappedPhrase - was extended.
A bug in the patch (described above) was fixed.

All tests pass.


Otis Gospodnetic added a comment - 16/Apr/07 06:23 PM
Doron, sounds like this is ripe for a commit now to take care of both this and LUCENE-697.

Doron Cohen added a comment - 19/Apr/07 05:23 AM
Need to see if the parts of the test (in QueryUtils) that were disabled by LUCENE-730 (BooleanScorer2 sometimes falls back to BooleanScorer). One possibility is to have two versions of this - a BooleanScoere version, and the rest - this issue (736) is about sloppy/exact phrase scoring, so it would fall into the "rest", and so the test would still catch this.

Doron Cohen added a comment - 24/Apr/07 05:15 AM
Changing the title to match what we decided to fix here.

Doron Cohen added a comment - 24/Apr/07 05:37 AM
Fixed.
  • Search time cost of this fix is 4%, for sloppy phrase search.
  • This fix also partially brings back the the tests checkSkipTo() in QueryUtils,
    (which was disabled by LUCENE-730), but this is now invoked only for non Boolean scorers.

Doron Cohen added a comment - 24/Apr/07 05:39 AM
Attaching for any future reference the fix that was applied for this.