[TIKA-818] Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow for a memory vs performance tradeoff - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.10, 1.0
Fix Version/s: 1.1
Component/s: parser
Labels:
None

Description

After upgrading to Tika 0.10, began having OOM errors processing large amounts of PDFs in parallel. The heap dump indicated that all the memory was getting used up by PDFBox RandomAccessBuffers. After digging around, it looks like PDFBox now defaults to using RAM vs temporary files for PDF extraction. This can be overridden to use RandomAccessFiless.

I propose that Tika controls file vs buffer based on the inputstream type received. If the TikaInputStream is a file, RandomAccessFile should be used and for other stream types, RandomAccessBuffer can be used.

I believe the code to control this is here:
https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

At ~line 87:
PDDocument pdfDocument =
PDDocument.load(new CloseShieldInputStream(stream), true);

Not sure if this is the best approach and am curious if there are other ideas on how to control this and keep the interface clean.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

choose_inmemory_vs_temp_file_pdf.patch
24/Jan/12 06:36
3 kB
Paul Pearcy
choose_inmemory_vs_temp_file_pdf_passes_tests.patch
24/Jan/12 07:31
3 kB
Paul Pearcy
PDFParser.java.patch
06/Feb/12 07:26
3 kB
Paul Pearcy

Activity

People

Assignee:: Unassigned

Reporter:: Paul Pearcy

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Dec/11 18:24

Updated:: 10/Feb/12 14:21

Resolved:: 10/Feb/12 14:21

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified