Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9107

CommonsTermsQuery with huge no. of terms slower with top-k scoring

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 8.3
    • None
    • core/search
    • None
    • New

    Description

      In [1] a CommonTermsQuery is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0.
      However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead [2].
      After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
      When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
      It'd be nice to understand the reason why this is happening and if it is only concerning CommonTermsQuery or affecting BooleanQuery as well.
      If this is a case that depends on the data and application involved (Anserini in this case), the application should handle it, otherwise if it is a regression/bug in Lucene it'd be nice to fix it.

      [1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
      [2] : https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
      [3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174

      Attachments

        1. image-2020-08-07-16-54-27-905.png
          1.40 MB
          Vincenzo D'Amore
        2. Screenshot 2020-08-07 at 16.20.01.png
          1.40 MB
          Vincenzo D'Amore
        3. Screenshot 2020-08-07 at 16.20.05.png
          1.39 MB
          Vincenzo D'Amore

        Activity

          People

            Unassigned Unassigned
            teofili Tommaso Teofili
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: