LUCENE-6562 to share ranges across segments. To guard against OOM exceptions the ranges needed to be purged once all segments have been visited. See GeoPointTermQuery line 87 for this check.
OK, it looks like this patch also absorbed the latest patch on
LUCENE-6562? It's clever how you clear ranges after the last segment!
Unfortunately I think it's unsafe ... e.g. if IndexSearcher is using
an executor, multiple threads can call getTermsEnum I think ... and
also, in the cached case, on opening a new NRT reader, I think the
query would only see e.g. the 1 or 2 new segments.
Net/net I'm worried about the complexity of getting this working
correctly; I think for now it's best/simpler to just re-compute the ranges
We should anyway try to keep the issue separate from this one...
Can we remove the separate bbox from GeoPointInPolygonQuery's ctor?
Yes! I believe so. The intent was to improve performance for detailed polygons (those with many vertices) by providing precomputed or cached bounding boxes rather than having the query recompute. Might be worth benchmarking to be certain whether we want to take away this utility. Sure it can be used maliciously, maybe good documentation can raise awareness?
But that cost will be miniscule compared to the overall query execution time right?
E.g. to filter just a single "borderline" hit, we also must walk the full polygon? Multiplied by the number of borderline hits ...
It just strikes me as such a trivial optimization that it's not worth polluting the API over ... APIs are hard enough as it is And e.g. I've already screwed it up at least once when I passed an invalid bbox before (when working on the test).
Large ranges were previously discouraged because it takes on the order of ~2secs to correctly compute the ranges for large search areas. Since this occurred for every segment large search over large data was quite time consuming. Reusing ranges across segments has brought this down to the amount of time it takes to compute the ranges.
OK that's a good improvement but I think large ranges are still very dangerous for this query? E.g. the query will suddenly take ... 10s of MBs heap to hold its ranges as well? Maybe (separately!) we could improve how we store the ranges in RAM, e.g. use two PrefixCodedTerms or something.
I think, like AutomatonQuery, we need to place a default limit on the number of ranges? User can increase it if they know what they are doing...
I put this nocommit in my last patch:
// nocommit why on earth are we needing to pass lat1/lon1 here the query doesn't get these...
Any idea how to fix? maxLat/maxLon cannot possibly matter here: they are randomly generated and not used by the query, yet they are used by radiusQueryCanBeWrong. This must be a test bug?