I think it's still useful though - I use it all the time!
Yeah but its slow with no easy chance of ever being faster. There is no simple bitset rewrite here like there is for other multiterm queries. Additionally It has all the downsides of an enormous boolean query, but with proximity to boot: and this is very real, even simple stuff like 1-2 KB RAM consumption per term due to additional decompression buffers for prox. Maybe in the future you could optionally index prefix terms, but I can't imagine merging proximity etc into a prefix-field for full-indexed-fields as a default, seems complicated and slow and space-consuming.
It would be nice if you could restrict the number of SpanOr clauses it rewrites to, but that's a separate issue.
+1, that is a great idea. We should really both do that and also add warnings to the javadocs about inefficiency. It has none today!
If you really think that moving .getSpans() and .extractTerms() to SpanWeight doesn't gain anything, then I can back it out. But I think it does simplify the API and brings it more into line with our other standard queries.
I totally agree it has the value of consistency with other queries. But some of the APIs trying to do this are fairly complicated, yet at the same time still not really working: see below for more explanation.
And I really don't see that exposing the termcontexts map on the SpanWeight constructor is any worse than exposing it directly in .getSpans(). In fact, I'd say that it's hiding it better - very few users of lucene are going to be looking at SpanWeights, as they're an implementation detail, but anyone using an IDE is going to be shown SpanQuery.getSpans() when they try and autocomplete on a SpanQuery object, and it's not something that most users need to worry about.
Its actually terrible already: the motivation for this stuff being to try to speedup the turtle in question, SpanMultiTermQuery. The reason this stuff was exposed, is because it could bring some relief to such crazy queries, by only visiting each term in the term dictionary less than 3 times (rewrite, weight/idf, postings). But this was never quite right for two reasons:
- Leniency: We can't enforce we are doing the performant thing because creation of weight/idf uses extractTerms(). So the SpanTermWeight inside the exclude portion of a SpanNot suddenly sees an unexpected term it has no termstate for. Maybe patches here removed this problem, but forgot to fix the leniency in SpanTermWeight, as I see at least the code comment is gone.
- Incomplete: SpanMultiTermQueryWrapper still isn't reusing the termcontext from rewrite(), somehow passing it down to the rewritten-spans. So the whole ugly thing isn't even totally working, its just reducing the number of visits to the term dictionary from 3 down to 2, but it is stupid that it is not 1.