Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8.10, 1.8.11, 2.0.0, 3.0.0 PDFBox
-
None
Description
I'm trying to parse a quite big PDF (26MB) and transform it to text, however I'm facing a huge memory consumption leading to out of memory error. Running my test with -Xmx768M will always fail. I've to increase to 1500M to make it work.
The resulting text is only 3MB so I don't understand why it is taking so much memory.
I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
The pdf can be found here
My code:
Test.java
@Test public void testParsePdf_Content_Memory() throws Exception { { InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf"); try { StringWriter writer = new StringWriter(); FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt")); PDFTextStripper pdfTextStripper = new PDFTextStripper(); pdfTextStripper.writeText(PDDocument.load(inputStream), fileWriter); fileWriter.close(); } finally { inputStream.close(); } }
Attachments
Attachments
Issue Links
- is related to
-
PDFBOX-5499 Performance issue since 2.0.18
- Closed
- relates to
-
TIKA-1907 Big Pdf parsing to text - Out of memory
- Open