Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9635

BM25FQuery - MultiNormsLeafSimScorer needs to mask long value for long documents

    XMLWordPrintableJSON

    Details

      Description

      Through some experimentation with the BM25FQuery on long documents, I've discovered that there is a bug that doesn't mask the encoded norm's long value during scoring. For long documents (or long fields) this may cause ArrayIndexOutOfBoundsExceptions.

      The line where I suspect the bug is being exposed is here
      https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java#L131

      Here is a similar use in BM25Similarity with the masking
      https://github.com/apache/lucene-solr/blob/c413656b627160d49eb9e9f1f84ec4945db80f0e/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L233

      My experimentation shows that to expose this bug, there must be a match for a token in more than one field (which is what BM25FQuery is for). In addition one of the fields must be >= 32792 tokens long.

      I've provided tests in the pull request to demonstrate this.

      Created a PR here: https://github.com/apache/lucene-solr/pull/2138

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yiluncui Yilun Cui
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h