OK I think I have a fix for this.
While looking at it, I realized that PhraseScorer (the one that used to base both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase scorer) is way too complicated and inefficient. All those sort calls after each matching doc can be avoided.
So I am modifying PhraseScorer to not have a phrase-queue at all - just the sorted linked list, which is always kept sorted by advancing last beyond first. Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic allows more efficient manipulation of it.
With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The queue size is set at the first candidate document. In order to handle repetitions (Same term in different query offsets) it will contain only some of the pps: those that either have no repetitions, or are the first (lower query offset) in a repeating group. A linked list of repeating pps was added: so PhrasePositions has a new member: nextRepeating.
Detection of repeating pps and creation of that list is done once per scorer: at the first candidate doc.
For solving the bugs reported here, in addition to the initiation of 'end' as explained in previous comment, advanceRepeatingPPs now also update two values:
- end, in case one of the repeating pps is far ahead (larger)
- position of the first pp in a repeating list (the one that is in the queue - in case the repeating pp is far behind (smaller). This can happen when there are holes in the query, as position = tpPOs - offset. It fixes the problem of false negative distances which caused this bug. It is tricky: relies on that PhrasePositions.nextPosition() ignores pp.position and just call positions.nextPosition(). But it is correct, as the modified position is used to replace pp in the queue.
Last, I think that the test added with holes had one wrong assert: It added four docs:
- drug drug
- drug druggy drug
- drug druggy druggy drug
- drug druggy drug druggy drug
defined this query (number is the offset):
and expected that with slop=1 the first doc would not be found.
I think it should be found, as the slop operates in both directions.
So modified the query to: drug(1) drug(3)
Patch to follow.