Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9107

CommonsTermsQuery with huge no. of terms slower with top-k scoring

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 8.3
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In [1] a CommonTermsQuery is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0.
      However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead [2].
      After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
      When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
      It'd be nice to understand the reason why this is happening and if it is only concerning CommonTermsQuery or affecting BooleanQuery as well.
      If this is a case that depends on the data and application involved (Anserini in this case), the application should handle it, otherwise if it is a regression/bug in Lucene it'd be nice to fix it.

      [1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
      [2] : https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
      [3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174

        Attachments

        1. Screenshot 2020-08-07 at 16.20.05.png
          1.39 MB
          Vincenzo D'Amore
        2. Screenshot 2020-08-07 at 16.20.01.png
          1.40 MB
          Vincenzo D'Amore
        3. image-2020-08-07-16-54-27-905.png
          1.40 MB
          Vincenzo D'Amore

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              teofili Tommaso Teofili
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: