Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3742

Unknown dir object c='>' cInt=62 peek='>' peekInt=62

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.13, 2.0.5
    • 1.8.14, 2.0.6, 3.0.0 PDFBox
    • Parsing
    • None
    • Based on Tika Docker image: logicalspark/docker-tikaserver

    Description

      This was originally stumbled upon when running a 69-page long PDF through Tika. I could isolate the issue to in-between those two pages. Tika ends up responding with a faulty XML, as the attached screenshot shows - together with a stacktrace on the logs that includes the PDFBox exception, shown below as reproduced from the standalone CLI tool.

      I'm using Tika 1.1.4, although I'm not exactly sure what version of PDFBox it uses. Here's the base Dockerfile.

      $ java -jar pdfbox-app-2.0.5.jar ExtractText buggy.pdf 
      Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
      WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldMT'
      Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
      WARNING: Using fallback font 'LiberationSans' for 'ArialMT'
      Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
      WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
      Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
      WARNING: Corrupt object reference at offset 150196
      Exception in thread "main" java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62 at offset 150196
      	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:954)
      	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:654)
      	at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
      	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
      	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
      	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      	at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
      	at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
      	at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
      

      Seems related to PDFBOX-1327.

      Attachments

        1. buggy.pdf
          439 kB
          Igor Santos
        2. screenshot_002.png
          19 kB
          Igor Santos

        Activity

          People

            tilman Tilman Hausherr
            igorsantos07 Igor Santos
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: