[LUCENE-8947] Indexing fails with "too many tokens for field" when using custom term frequencies - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 7.5
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

We are using custom term frequencies (~~LUCENE-7854~~) to index per-token scoring signals, however for one document that had many tokens and those tokens had fairly large (~998,000) scoring signals, we hit this exception:

2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: java.lang.IllegalArgumentException: too many tokens for field "foobar"
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)

This is happening in this code in DefaultIndexingChain.java:

  try {
    invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency());
  } catch (ArithmeticException ae) {
    throw new IllegalArgumentException("too many tokens for field \"" + field.name() + "\"");
  }

Where Lucene is accumulating the total length (number of tokens) for the field. But total length doesn't really make sense if you are using custom term frequencies to hold arbitrary scoring signals? Or, maybe it does make sense, if user is using this as simple boosting, but maybe we should allow this length to be a long?

Attachments

Issue Links

links to

GitHub Pull Request #2080

Activity

People

Assignee:: Unassigned

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Aug/19 16:57

Updated:: 28/Aug/22 15:49

Resolved:: 16/Jan/21 12:14

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m