Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
8.3
-
None
-
None
-
New
Description
In [1] a CommonTermsQuery is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0.
However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead [2].
After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
It'd be nice to understand the reason why this is happening and if it is only concerning CommonTermsQuery or affecting BooleanQuery as well.
If this is a case that depends on the data and application involved (Anserini in this case), the application should handle it, otherwise if it is a regression/bug in Lucene it'd be nice to fix it.
[1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
[2] : https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
[3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174