Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3451

IOException at org.apache.pdfbox.pdfparser.BaseParser.readLong

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.1, 2.0.2
    • None
    • Text extraction

    Description

      Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) throws following exception on text extraction from valid PDF document:

      org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5b529706
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
      at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134)
      Caused by: java.io.IOException: Error: Expected a long type at offset 9003008, instead got '?3????'
      at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1350)
      at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1278)
      at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:739)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:721)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:652)
      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:612)
      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:215)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:840)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:780)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:130)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      ... 6 more
      Caused by: java.lang.NumberFormatException: For input string: "?3????"
      at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
      at java.lang.Long.parseLong(Long.java:589)
      at java.lang.Long.parseLong(Long.java:631)
      at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1345)
      ... 17 more

      Please, find failing document and log with StackTrace in attachments.

      Attachments

        1. att3x1l.pdf
          10.37 MB
          Yauheni Salopiy
        2. att3x1l.txt
          100 kB
          Maruan Sahyoun
        3. PDFBOX-3451_LOG.txt
          2 kB
          Yauheni Salopiy

        Activity

          tilman Tilman Hausherr added a comment - - edited

          The file is corrupt. It has a large amount of zeros near the mentioned offset 9003008.

          tilman Tilman Hausherr added a comment - - edited The file is corrupt. It has a large amount of zeros near the mentioned offset 9003008.
          Genstr Yauheni Salopiy added a comment - - edited

          Hi tilman,

          Thank You for the investigation.

          Is it possible to make PDF Box more forgiving to such cases?
          I'm asking because I can open this PDF document with Acrobat Reader DC though I can confirm that other PDF Readers I tried wasn't able to open it.

          Thank You in advance!

          Best Regards,
          Yauheni Salopiy

          Genstr Yauheni Salopiy added a comment - - edited Hi tilman , Thank You for the investigation. Is it possible to make PDF Box more forgiving to such cases? I'm asking because I can open this PDF document with Acrobat Reader DC though I can confirm that other PDF Readers I tried wasn't able to open it. Thank You in advance! Best Regards, Yauheni Salopiy

          See my answer in PDFBOX-3452.

          tilman Tilman Hausherr added a comment - See my answer in PDFBOX-3452 .
          msahyoun Maruan Sahyoun added a comment -

          probably fixed a while ago. Can't find a matching ticket though. Works at least since 2.0.20. Earlier versions not tested.

          msahyoun Maruan Sahyoun added a comment - probably fixed a while ago. Can't find a matching ticket though. Works at least since 2.0.20. Earlier versions not tested.

          Hi tilman, msahyoun,

          Thank You!

          Best Regards,

          Yauheni Salopiy

          Genstr Yauheni Salopiy added a comment - Hi tilman , msahyoun , Thank You! Best Regards, Yauheni Salopiy

          People

            Unassigned Unassigned
            Genstr Yauheni Salopiy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: