[PDFBOX-3284] Big Pdf parsing to text - Out of memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.10, 1.8.11, 2.0.0, 3.0.0 PDFBox
Fix Version/s: 3.0.0 PDFBox
Component/s: Parsing
Labels:
None

Description

I'm trying to parse a quite big PDF (26MB) and transform it to text, however I'm facing a huge memory consumption leading to out of memory error. Running my test with -Xmx768M will always fail. I've to increase to 1500M to make it work.
The resulting text is only 3MB so I don't understand why it is taking so much memory.

I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.

The pdf can be found here

My code:

Test.java

@Test
public void testParsePdf_Content_Memory() throws Exception {
{
    InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
    try {
             StringWriter writer = new StringWriter();
	     FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt"));

             PDFTextStripper pdfTextStripper = new PDFTextStripper();
	     pdfTextStripper.writeText(PDDocument.load(inputStream), fileWriter);

             fileWriter.close();
    } finally {
        inputStream.close();
    }
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

massparse-stat.txt
27/Mar/16 16:43
331 kB
Tilman Hausherr

Issue Links

is related to

PDFBOX-5499 Performance issue since 2.0.18

Closed

relates to

TIKA-1907 Big Pdf parsing to text - Out of memory

Open

Activity

People

Assignee:: Unassigned

Reporter:: Nicolas Daniels

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 22/Mar/16 16:03

Updated:: 18/Aug/23 05:46

Resolved:: 14/Jun/19 03:59