Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7058

Data-driven schema needs to index large text fields as text and not as string


    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Data-driven Schema
    • Labels:


      While using the SimplePostTool to index some freebase articles into a core that uses our data-driven configs, I ran into the following gem:

      Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="xml_data" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46, 48, 34, 32, 101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message: bytes can be at most 32766 in length; got 173684
      	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
      	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
      	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
      	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
      	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
      	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
      	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
      	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)

      Ideally, the data-driven configs would index large text fields containing multiple tokens (whitespace delimited) as text and not a string. However, this obviously poses an issue if the first doc has a short text value that looks like a string and then the next doc has a large one. Not sure what the right solution looks like yet, but wanted to capture the issue so we can discuss options.


          Issue Links



              • Assignee:
                thelabdude Timothy Potter
              • Votes:
                0 Vote for this issue
                3 Start watching this issue


                • Created: