After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one.
1) Missing Matches caused by lazy iteration:
I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches.
*) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match don't match. This is caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient.
*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case.
*) In the unordered case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail.
My current opinion: We should give up lazy iteration for the unordered and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least for the ordered case I am optimistic that I could implement it efficiently.
2) Inconsistent Scoring of SpanNearQuery
*) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery.
*) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength))
*) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score.