Details
-
Improvement
-
Status: Closed
-
Blocker
-
Resolution: Resolved
-
2.5.0
-
None
-
We are running this through docker on a machine with plenty of memory resources allocated to Docker.
Docker config: 32 GB, 8 processors
Host machine: 64 GB, 32 processorsOur docker-compose configuration is derived from: https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml
We are experienced with Docker and are confident that the issue isn't with Docker.
We are running this through docker on a machine with plenty of memory resources allocated to Docker. Docker config: 32 GB, 8 processors Host machine: 64 GB, 32 processors Our docker-compose configuration is derived from: https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml We are experienced with Docker and are confident that the issue isn't with Docker.
-
Important
Description
I have specified this parser parameter in tika-config.xml:
<properties>
<parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<params>
<paramname="byteArrayMaxOverride"type="int">700000000</param>
</params>
</parser>
</properties>
I've also verified that the tika-config.xml is being picked-up by Tika on startup:
org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml
However, when I encounter a very large docx file, I can clearly see that the configuration in tika-config is not being picked-up:
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
I understand that this is a very large docx file. However, we can handle this amount of text extraction and am fine with the time it takes for Tika to perform this extraction and the amount of memory required to complete this extraction.