[TIKA-3847] NullPointerException when processing pdf document(Allow proceed on RuntimeException) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.1
Fix Version/s: 2.5.0
Component/s: parser
Labels:
None

Description

I have a pdf document with some corrupted pages(throws error even in PDF readers like adobe acrobat). However there are only few pages of 370 are failing.

The issue is that first corrupted page is 15th and whole document processing failed after that.

There is a nullPointer exception thrown on getting fonts of the corrupted page.
What I propose is to allow(by config, default false) handle any runtime exception like we do for IntermediaryIOExceptions.

Unfortunately due to NDA I can't share a document for debug purposes but here is a stacktrace below:

java.lang.NullPointerException
    at org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
    at org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
    at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
    at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
    at org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
    at org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
    at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)

I will attach proposed patch soon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-3847.patch
06/Sep/22 10:25
7 kB
Yurii

Issue Links

is related to

PDFBOX-5500 NullPointerException in PDType0Font.readCode() if cMap is null

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Yurii

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Sep/22 10:08

Updated:: 20/Sep/22 17:22

Resolved:: 20/Sep/22 17:22