Thanks Robert, okay, I'll continue with option 2 then.
In addition, perhaps should try harder for a sloppy version of current ExactPhraseScorer, for both performance and correctness reasons.
In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the conditions for a match (still) holds.
A sloppy version of this, with N terms and slop=S could increment differently:
1 + N*S at posIndex
1 + N*S - 1 at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
1 + N*S - S at posIndex-S and posIndex+S
For S=0, this falls back to only increment by 1 and only at posIndex, same as the ExactPhraseScorer, which makes sense.
Also, the success criteria in ExactPhraseScorer, when checking term k, is, before adding up 1 for term k:
- count[posIndex] == k-1
Or, after adding up 1 for term k:
- count[posIndex] == k
In the sloppy version the criteria (after adding up term k) would be:
- count[posIndex] >= k*(1+N*S)-S
Again, for S=0 this falls to the ExactPhraseScorer logic:
Mike (and all), correctness wise, what do you think?
If you are wondering why the increment at the actual position is (1 + N*S) - it allows to match only posIndexes where all terms contributed something.
I drew an example with 5 terms and slop=2 and so far it seems correct.
Also tried 2 terms and slop=5, this seems correct as well, just that, when there is a match, several posIndexes will contribute to the score of the same match. I think this is not too bad, as for these parameters, same behavior would be in all documents. I would be especially forgiving for this if we this way get some of the performance benefits of the ExactPhraseScorer.
If we agree on correctness, need to understand how to implement it, and consider the performance effect. The tricky part is to increment at posIndex-n. Say there are 3 terms in the query and one of the terms is found at indexes 10, 15, 18. Assume the slope is 2. Since N=3, the max increment is:
So the increments for this term would be (pos, incr):
8 , 5
9 , 6
10 , 7
11 , 6
12 , 5
13 , 5
14 , 6
15 , 7 = max(7,5)
16 , 6 = max(6,5)
17 , 6 = max(5,6)
18 , 7
19 , 6
20 , 5
So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we do not know yet about the contribution of posIndex 18, which is 6, and should be used instead of 5. So some look-ahead (or some fix-back) is required, which will complicate the code.
If this seems promising, should probably open a new issue for it, just wanted to get some feedback first.