Description
I have a pdf document with some corrupted pages(throws error even in PDF readers like adobe acrobat). However there are only few pages of 370 are failing.
The issue is that first corrupted page is 15th and whole document processing failed after that.
There is a nullPointer exception thrown on getting fonts of the corrupted page.
What I propose is to allow(by config, default false) handle any runtime exception like we do for IntermediaryIOExceptions.
Unfortunately due to NDA I can't share a document for debug purposes but here is a stacktrace below:
java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
at org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
at org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
at org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
I will attach proposed patch soon.
Attachments
Attachments
Issue Links
- is related to
-
PDFBOX-5500 NullPointerException in PDType0Font.readCode() if cMap is null
- Closed