Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3750

java.util.zip.DataFormatException when parsing a PDF

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.3
    • 2.0.6, 3.0.0 PDFBox
    • None
    • None
    • java version "1.8.0_121"
      Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
      Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

      org.apache.tika:tika-parsers:1.14

    Description

      I use the following code to parse a PDF:

      PDFParser pdfparser = new PDFParser();
              pdfparser.parse(Test.class.getResourceAsStream("/testdoc.pdf"), handler, metadata, pcontext);
      

      This results in the following exception:

      Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:133)
      	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
      	at com.curecomp.tika.Test.main(Test.java:28)
      Caused by: java.io.IOException: java.util.zip.DataFormatException: too many length or distance symbols
      	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:82)
      	at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69)
      	at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:162)
      	at org.apache.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:189)
      	at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:134)
      	at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:84)
      	at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:164)
      	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
      	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
      	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
      	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
      	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
      	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
      	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:141)
      	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
      	... 2 more
      Caused by: java.util.zip.DataFormatException: too many length or distance symbols
      	at java.util.zip.Inflater.inflateBytes(Native Method)
      	at java.util.zip.Inflater.inflate(Inflater.java:259)
      	at java.util.zip.Inflater.inflate(Inflater.java:280)
      	at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:107)
      	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:73)
      	... 21 more
      

      The PDF can be read using Adobe Reader XI 11.0.12.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tilman Tilman Hausherr
            Mobe Moritz Becker
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment