Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8947

Indexing fails with "too many tokens for field" when using custom term frequencies

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 7.5
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We are using custom term frequencies (LUCENE-7854) to index per-token scoring signals, however for one document that had many tokens and those tokens had fairly large (~998,000) scoring signals, we hit this exception:

      2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: java.lang.IllegalArgumentException: too many tokens for field "foobar"
      at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
      at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
      at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
      at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
      at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
      at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
      at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
      

      This is happening in this code in DefaultIndexingChain.java:

        try {
          invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency());
        } catch (ArithmeticException ae) {
          throw new IllegalArgumentException("too many tokens for field \"" + field.name() + "\"");
        }

      Where Lucene is accumulating the total length (number of tokens) for the field.  But total length doesn't really make sense if you are using custom term frequencies to hold arbitrary scoring signals?  Or, maybe it does make sense, if user is using this as simple boosting, but maybe we should allow this length to be a long?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mikemccand Michael McCandless
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m