Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3847

NullPointerException when processing pdf document(Allow proceed on RuntimeException)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.1
    • 2.5.0
    • parser
    • None

    Description

      I have a pdf document with some corrupted pages(throws error even in PDF readers like adobe acrobat). However there are only few pages of 370 are failing.

      The issue is that first corrupted page is 15th and whole document processing failed after that.

      There is a nullPointer exception thrown on getting fonts of the corrupted page.
      What I propose is to allow(by config, default false) handle any runtime exception like we do for IntermediaryIOExceptions.

      Unfortunately due to NDA I can't share a document for debug purposes but here is a stacktrace below:

      java.lang.NullPointerException
          at org.apache.pdfbox.pdmodel.font.PDType0Font.readCode(PDType0Font.java:574)
          at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:745)
          at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:635)
          at org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:56)
          at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
          at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
          at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
          at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
          at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155)
          at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363)
          at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:291)
          at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
          at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:202)
          at org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.detectAnglesAndProcessPage(PDF2XHTML.java:307)
          at org.apache.tika.parser.pdf.PDF2XHTML$AngleDetectingPDF2XHTML.processPage(PDF2XHTML.java:293)
          at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1204)
          at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238)
          at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
          at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) 

      I will attach proposed patch soon.

      Attachments

        1. TIKA-3847.patch
          7 kB
          Yurii

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              yshyman Yurii
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: