What SpansTreeQuery does not do, and some rough edges:
The SpansDocScorer objects do the match recording and scoring, and there is one for each Spans.
These SpansDocScorer objects might be merged into their Spans to reduce the number of objects.
Related: how to deal with the same term occurring in more than one subquery? See also LUCENE-7398.
Normally the term frequency score has a diminishing contribution for extra occurrences.
In the patch the slop factors for a term are applied in decreasing order on these diminished contributions.
This requires sorting of the slop factors.
Sorting the slop factors could be avoided when an actual score of a single term occurrence was available.
In that case the given slop factor could be used as a weight on that score.
It might be possible to estimate an actual score for a single term occurrence
from the distances to other occurrences of the same term.
Similarly, the decreasing term frequency contributions can be seen as a proximity weighting for the same term (or subquery):
the closer a term occurs to itself, the smaller its contribution.
This might be refined by using the actual distances to other the term occurrences (or subquery occurrences)
to provide a weight for each term occurrence. This is unusual because the weight decreases for smaller distances.
The slop factor from the Similarity may need to be adapted because of the way it is combined here
with diminishing term contributions.
Another use of a score of each term occurrence could be to use the absolute term position
to influence the score, possibly in combination with the field length.
There is an assert in TermSpansDocScorer.docScore() that verifies that
the smallest occurring slop factor is at least as large as the non matching slop factor.
This condition is necessary for consistency.
Instead of using this assert, this condition might be enforced by somehow
automatically determining the non matching slop factor.
This is a prototype. No profiling has been done, it will take more CPU, but I have no idea how much.
Garbage collection might be affected by the reference cycles between the SpansDocScorers
and their Spans.
Since this allows weighting of subqueries, it might be possible to implement synonym scoring
in SpanOrQuery by providing good subweights, and wrapping the whole thing in SpansTreeQuery.
The only thing that might still be needed then is a SpansDocScorer that applies the SimScorer.score()
over the total term frequency of the synonyms in a document.
SpansTreeScorer multiplies the slop factor for nested near queries at each level.
Alternatively a minimum distance could be passed down.
This would need to change recordMatch(float slopFactor) to recordMatch(int minDistance).
Would minDistance make sense, or is there a better distance?
What is a good way to test whether the score values from SpansTreeQuery actually improve on
the score values from the current SpanScorer?
There are no tests for SpanFirstQuery/SpanContainingQuery/SpanWithinQuery.
These tests are not there because these queries provide FilterSpans and that is already supported for SpanNotQuery.
The explain() method is not implemented for SpansTreeQuery.
This should be doable with an explain() method added to SpansTreeScorer to provide the explanations.
There is no support for PayloadSpanQuery.
PayloadSpanQuery is not in here because it is not in the core module.
I think it can fit here in because PayloadSpanQuery also scores per matching term occurrence.
Then Spans.doStartCurrentDoc() and Spans.doCurrentSpans() could be removed.
In case this is acceptable as a good way to score Spans:
Spans.width() and Scorer.freq() and SpansDocScorer.docMatchFreq() might be removed.
Would it make sense to implement child Scorers in the tree of SpansDocScorer objects?