Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7058

Data-driven schema needs to index large text fields as text and not as string

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Data-driven Schema
    • Labels:
      None

      Description

      While using the SimplePostTool to index some freebase articles into a core that uses our data-driven configs, I ran into the following gem:

      Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="xml_data" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46, 48, 34, 32, 101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message: bytes can be at most 32766 in length; got 173684
      	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
      	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
      	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
      	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
      	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
      	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
      	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
      	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
      

      Ideally, the data-driven configs would index large text fields containing multiple tokens (whitespace delimited) as text and not a string. However, this obviously poses an issue if the first doc has a short text value that looks like a string and then the next doc has a large one. Not sure what the right solution looks like yet, but wanted to capture the issue so we can discuss options.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                thelabdude Timothy Potter
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: