|
Patch with a fix.
Problem was in the logic for advancing PhrasePositions that were pointing to exactly the same term in the document. As a side effect of this fix sorting of the "repeats" array (at scorer initialization) is no longer required. Grant's test is also in the patch, slightly modified. Grant, can you give it a try and report here? There is a potential NPE. I am adding a patch (which should apply on top of the previous patch) that adds a test case and possible fix.
I will continue testing this. Thanks for looking at it, Doron. You're right! And thanks for cacthing this!
NPE is possible and I see it too with the new test. I think the suggested new fix is likely to miss additional matches in the same doc. I'll later post a test that shows this. (As a side comment, its useful to post patches named "LUCENE-NNN.patch" where NNN is the issue number. Updated patch with a fix for the NPE and with a test that fails with the previous fix for the NPE.
The point is to switch to the pp with higher (query) offset in case two pps are in the same (doc) position. Previous patch might restore the queue wrongly - pop pp but put pp2.
This patch fixes that by returning the correct pp into the pq. However it is yet not perfect since the one pp returned to pq might not be the last one advanced. This means pq could be sorted incorrectly with regard to repeating terms. I didn't manage to create a test case that fails due to this - testDoc4_Query3_All_Slops_Should_match in the test was the last trial to catch this. The only perfect solution I see is to re-populate the queue when this happens but this is costly and I tend not to do it. Open for suggestions... Updated patch fixes this issue.
In case of repeating terms in the query, this might be slower than previous patch, but it is supposedly correct in all cases while the previous one was not guaranteed to be always correct. There are no performance implications for the more common case of no repeating terms in the query. I plan to commit this in a day or two. termPositionsDiffer can return the PhrasePositions that was passed in (pp), which means you could be passing the same PhrasePositions to flip in both arguments. But flip assumes that the arguments are different. This can result in an ArrayIndexOutOfBoundsException in flip. Perhaps just checking that pp2 != pp on line 76 would be sufficient to avoid this.
I have not been able to come up with a simple test case that triggers this. I have a complex one, but it uses a custom Analyzer. You're right, Grant, good catch, thanks!
(I'm sure had this check but lost it when refactoring flipping to a method) Updated patch to follow. Updated patch with required check (pp!=pp2) before flipping as Grant suggested.
Committed to trunk.
I was wondering if this should be backported to 2.3.1 and/or 2.3.2. Fixed, thanks for the review Grant!
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I think I see where the problem is - working on it.