The pain of not using any review tool (no matter how imperfect any might be) is quite apparent to me with these patch updates and multi-user interaction. Let the pain continue? Rob; you could ignore the RB if so long as if the patches are here for you to do the perf testing. I don't mean to suggest we should use a review tool for every JIRA issue.
SpanNear: when the subSpans are initialized and one of them is null, null is returned for the NearSpans... . I think SpanNear should have at least two subqueries, I did not expect that to break backward compatibility. The single subquery case could be restored, but is it really needed? When this happens the caller should be aware that the slop is going to be ignored, and in the ordered case there should be no match at all because the subSpans should not overlap then.
I simply recognized it as a difference and wondered why. If some thing "should" not happen (you claim SpanNear should have at least 2 sub-queries) then an assertion or comment to that effect would be good. As to the particulars of this case, I'm not sure at the moment without further inspection.
SpanNotQuery and SpanScorer: as I understood there is no need for these two implement TwoPhaseIterator because that is only needed when there is some form of conjunction. SpanScorer only takes a single Spans, and for SpanNot advance() on the excluded spans only is good enough.
I believe it's useful for any aggregation, even over one so that the TPI benefits percolate across parent/child queries in aggregate. Adrien Grand please check me on this. Having said this I don't know why TPIs aren't propagated more generally; maybe a TODO?.
On to my example: If any Query that wraps another Query (technically a Scorer wrapping another Scorer, or Span wrapping another Span here) fails to propagate the TwoPhaseIterator, then the benefit of the TPI is only localized to the nodes that directly support it (e.g. BooleanQuery & SpanNearQuery). Consider the case of a top level BooleanQuery wrapping two MUST clauses, one clause with a medium-frequency term, and the other clause being a SpanNearQuery between some other SpanQuery derivatives which include a low frequency term. With your patch's SpanScorer not supporting TPI, the top level BQ won't see a TPI so it will drive the query by the low-cost SpanQuery which will force it to visit positions for documents that could have been filtered by the top level term query's document list.
Removing Spans/intervals completely is not an option: with (nested) proximity queries there is no way to avoid iterating over intervals/spans. We can make the scoring consistent between Spans and PhraseQuery, and PhraseQuery with getSpans() would also be good for performance of common cases.
I could have been clearer. I simply mean I look forward to the dichotomy being gone or made easier. That is, today, a SpanNearQuery only accepts SpanQueries, which forces us to convert other queries to them (even TermQuery), which is a pain. Any way, that discussion should be left to LUCENE-2878 so I shouldn't have brought it up here.