[TIKA-3519] Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.25, 1.26
Fix Version/s: None
Component/s: detector
Labels:
None
Environment:

Linux

Description

We use org.apache.tika.parser.AutoDetectParser to get the metadata and body content of MS office files. We encountered the following exception with some files

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 14523048, but 5000000 is the maximum for this record type. If the file is not corrupt, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file as follows

</params>

</parser>

This helped to parse some files that failed previously. But some other files still failed. And then we increased the value to 200 MB and 500 MB.

Some other file may still fail with byteArrayMaxOverride set to 500 MB. So we wonder if you can add a feature to the Tika parser for it to stop reading metadata and body content if certain amount of memory or body content has reached. The parser will return the metadata and body content obtained so far. A warning message will be returned to the caller if this happens. This will help us to get the metadata and body content from some files that requires a lot of memory. We may not be able to successfully parse some files without this feature because those files fail somewhere else with the out-of-memory error after we set byteArrayMaxOverride to very high values and the above mentioned failure does not happen. With this feature we will get truncated body content with some files but it is better than get nothing. Actually we will truncate the body content ourselves if it is too large. So we do not care if the body content is truncated if it reaches certain amount.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Xiaohong Yang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Aug/21 15:56

Updated:: 18/Aug/21 02:53