Issue Details (XML | Word | Printable)

Key: LUCENE-1310
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Doron Cohen
Reporter: Grant Glouser
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Phrase query with term repeated 3 times requires more slop than expected

Created: 20/Jun/08 03:48 AM   Updated: 05/Aug/08 07:44 AM
Return to search
Component/s: Search
Affects Version/s: 2.3.1, 2.3.2
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1310.1.patch 2008-06-27 03:57 AM Grant Glouser 2 kB
Text File Licensed for inclusion in ASF works LUCENE-1310.patch 2008-07-01 04:01 AM Doron Cohen 13 kB
Text File Licensed for inclusion in ASF works LUCENE-1310.patch 2008-06-30 03:56 PM Doron Cohen 13 kB
Text File Licensed for inclusion in ASF works LUCENE-1310.patch 2008-06-28 08:19 PM Doron Cohen 12 kB
Text File Licensed for inclusion in ASF works LUCENE-1310.patch 2008-06-28 04:29 PM Doron Cohen 11 kB
Text File Licensed for inclusion in ASF works LUCENE-1310.patch 2008-06-26 06:05 PM Doron Cohen 10 kB
Text File Licensed for inclusion in ASF works LUCENE-2.3.1-1310.patch 2008-07-01 04:28 AM Doron Cohen 13 kB
Java Source File Licensed for inclusion in ASF works TestSloppyPhraseQuery.java 2008-06-20 03:50 AM Grant Glouser 3 kB

Lucene Fields: New
Resolution Date: 05/Aug/08 07:44 AM


 Description  « Hide
Consider a document with the text "A A A".
The phrase query "A A A" (exact match) succeeds.
The query "A A A"~1 (same document and query, just increasing the slop value by one) fails.
"A A A"~2 succeeds again.

If the exact match succeeds, I wouldn't expect the same query but with more slop to fail. The fault seems to require some term to be repeated at least three times in the query, but the three occurrences do not need to be adjacent. I will attach a file that contains a set of JUnit tests that demonstrate what I mean.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doron Cohen added a comment - 26/Jun/08 04:28 PM
I am too getting the error with the test. It is a bug indeed.
I think I see where the problem is - working on it.

Doron Cohen added a comment - 26/Jun/08 06:05 PM
Patch with a fix.

Problem was in the logic for advancing PhrasePositions that were pointing to exactly the same term in the document.
Note that this advancing is required to avoid false matches (as was fixed in LUCENE-736).
However must first advance the PhrasePosition whose offset (in the query) is the highest.

As a side effect of this fix sorting of the "repeats" array (at scorer initialization) is no longer required.

Grant's test is also in the patch, slightly modified.

Grant, can you give it a try and report here?


Grant Glouser added a comment - 27/Jun/08 03:57 AM
There is a potential NPE. I am adding a patch (which should apply on top of the previous patch) that adds a test case and possible fix.

I will continue testing this. Thanks for looking at it, Doron.


Doron Cohen added a comment - 28/Jun/08 04:04 PM
You're right! And thanks for cacthing this!
NPE is possible and I see it too with the new test.
I think the suggested new fix is likely to miss additional matches in the same doc.
I'll later post a test that shows this.

(As a side comment, its useful to post patches named "LUCENE-NNN.patch" where NNN is the issue number.
This way JIRA shows which fix is the most recent very clearly, and also, being a complete patch, everyone can easily apply the entire patch.
More on this in the Wiki under HowToContribute.)


Doron Cohen added a comment - 28/Jun/08 04:29 PM
Updated patch with a fix for the NPE and with a test that fails with the previous fix for the NPE.
The point is to switch to the pp with higher (query) offset in case two pps are in the same (doc) position.

Doron Cohen added a comment - 28/Jun/08 08:19 PM
Previous patch might restore the queue wrongly - pop pp but put pp2.
This patch fixes that by returning the correct pp into the pq.
However it is yet not perfect since the one pp returned to pq might not be the last one advanced.
This means pq could be sorted incorrectly with regard to repeating terms.
I didn't manage to create a test case that fails due to this - testDoc4_Query3_All_Slops_Should_match in the test was the last trial to catch this.
The only perfect solution I see is to re-populate the queue when this happens but this is costly and I tend not to do it.
Open for suggestions...

Doron Cohen added a comment - 30/Jun/08 03:56 PM
Updated patch fixes this issue.
In case of repeating terms in the query, this might be slower than previous patch, but it is supposedly correct in all cases while the previous one was not guaranteed to be always correct. There are no performance implications for the more common case of no repeating terms in the query.
I plan to commit this in a day or two.

Grant Glouser added a comment - 30/Jun/08 11:47 PM
termPositionsDiffer can return the PhrasePositions that was passed in (pp), which means you could be passing the same PhrasePositions to flip in both arguments. But flip assumes that the arguments are different. This can result in an ArrayIndexOutOfBoundsException in flip. Perhaps just checking that pp2 != pp on line 76 would be sufficient to avoid this.

I have not been able to come up with a simple test case that triggers this. I have a complex one, but it uses a custom Analyzer.


Doron Cohen added a comment - 01/Jul/08 03:56 AM
You're right, Grant, good catch, thanks!
(I'm sure had this check but lost it when refactoring flipping to a method)
Updated patch to follow.

Doron Cohen added a comment - 01/Jul/08 04:01 AM
Updated patch with required check (pp!=pp2) before flipping as Grant suggested.

Doron Cohen added a comment - 01/Jul/08 04:28 AM
same patch but for 2.3.1.

Doron Cohen added a comment - 03/Aug/08 03:54 PM
Committed to trunk.

I was wondering if this should be backported to 2.3.1 and/or 2.3.2.
It is not a major bug so I think not, though it might be critical to some application.
Didn't find a guideline for this in the wiki but I may be missing it?


Doron Cohen added a comment - 05/Aug/08 07:44 AM
Fixed, thanks for the review Grant!