Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Invalid
-
1.6.0, 1.8.7, 2.0.0
-
None
-
None
-
Windows Vista
Sun JDK 1.6.0_26
Description
I am getting the StreamCorruptedException when trying to parse a possibly invalid PDF document even if the -force option is specified.
Stack trace:
java.io.StreamCorruptedException: Error: data is null
at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
My suggestion is to skip bad sub-streams without throwing exceptions in PDFStreamEngine.processSubStream() in case of forceParsing is true.