Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3880

Tika not picking-up setByteArrayMaxOverride from tika-config

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Blocker
    • Resolution: Resolved
    • 2.5.0
    • 2.5.0
    • app
    • None
    • Important

    Description

      I have specified this parser parameter in tika-config.xml:

      <properties>
        <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
          <params>
            <paramname="byteArrayMaxOverride"type="int">700000000</param>
          </params>
      </parser>
      </properties>
       
      I've also verified that the tika-config.xml is being picked-up by Tika on startup:
        org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml
       
      However, when I encounter a very large docx file, I can clearly see that the configuration in tika-config is not being picked-up:
       
      Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000.
      If the file is not corrupt and not large, please open an issue on bugzilla to request 
      increasing the maximum allowable size for this record type.
      You can set a higher override value with IOUtils.setByteArrayMaxOverride()
       
      I understand that this is a very large docx file. However, we can handle this amount of text extraction and am fine with the time it takes for Tika to perform this extraction and the amount of memory required to complete this extraction. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            ethanw Ethan Wilansky
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: