[TIKA-3890] Identifying an efficient approach for getting page count prior to running an extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.5.0
Fix Version/s: 2.5.0
Component/s: app
Labels:
None
Environment:

OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
Docker container with 5.5GB reserved memory, 6GB limit
Tika config w/ 2GB reserved memory, 5GB limit

Description

Tika is doing a great job with text extraction, until we encounter an Office document with an unreasonably large number of pages with extractable text. For example a Word document containing thousands of text pages. Unfortunately, we don't have an efficient way to determine page count before calling the /tika or /rmeta endpoints and either getting back an array allocation error or setting byteArrayMaxOverride to a large number to return the text or metadata containing the page count. Returning a result other than the array allocation error can take significant time.

For example, this call:
curl -T ./8mb.docx -H "Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" http://localhost:9998/rmeta/ignore

with the configuration:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
<parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>
</parser>
<parser class="org.apache.tika.parser.microsoft.OfficeParser">
<params>
<param name="byteArrayMaxOverride" type="int">175000000</param>
</params>
</parser>
</parsers>
<server>
<params>
<taskTimeoutMillis>120000</taskTimeoutMillis>
<forkedJvmArgs>
<arg>-Xms2000m</arg>
<arg>-Xmx5000m</arg>
</forkedJvmArgs>
</params>
</server>
</properties>

returns: "xmpTPg:NPages":"14625" in ~53 seconds.

Yes, I know this is a huge docx file and I don't want to process it. If I don't configure byteArrayMaxOverride I get this exception in just over a second:

Tried to allocate an array of length 172,983,026, but the maximum length for this record type is 100,000,000. which is the preferred result.

The exception is the preferred result. With that in mind, can you answer these questions?
1. Will other extractable file types that don't use the OfficeParser also throw the same array allocation error for very large text extractions?
2. Is there any way to correlate the array length returned to the number of lines or pages in the associated file to parse?
3. Is there an efficient way to calculate lines or pages of extractable content in a file before sending it for extraction? It doesn't appear that /rmeta with the /ignore path param significantly improves efficiency over calling the /tika endpoint or /rmeta w/out /igmore

If its useful, I can share the 8MB docx file containing 14k pages.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ethan Wilansky

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Oct/22 21:20

Updated:: 20/Oct/22 19:10

Resolved:: 20/Oct/22 19:10