Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3448

NullPointerException at org.apache.pdfbox.pdmodel.common.COSArrayList.convertFloatCOSArrayToList

    XMLWordPrintableJSON

Details

    Description

      A number of valid PDF documents failing in Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) on text extraction with following exception:

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3e14c16d
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
      at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134)
      Caused by: java.lang.NullPointerException
      at org.apache.pdfbox.pdmodel.common.COSArrayList.convertFloatCOSArrayToList(COSArrayList.java:297)
      at org.apache.pdfbox.pdmodel.font.PDFont.getWidths(PDFont.java:462)
      at org.apache.pdfbox.pdmodel.font.PDFont.getWidth(PDFont.java:229)
      at org.apache.pdfbox.pdmodel.font.PDFont.getDisplacement(PDFont.java:212)
      at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:695)
      at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:564)
      at org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
      at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
      at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
      at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:144)
      at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:112)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:151)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      ... 6 more

      Please, find failing documents and log with exceptions StackTrace in attachments.

      Attachments

        1. 101119respmoeotprovidewitlist.pdf
          399 kB
          Yauheni Salopiy
        2. 110111respmemosuppmodiscov.pdf
          773 kB
          Yauheni Salopiy
        3. 110111respmoordcompeldisc.pdf
          896 kB
          Yauheni Salopiy
        4. 110111respmoordcompelexhibad.pdf
          1.53 MB
          Yauheni Salopiy
        5. 110111respmoordcompelexhibeg.pdf
          1.95 MB
          Yauheni Salopiy
        6. 110131respspprevieworddeny.pdf
          4.62 MB
          Yauheni Salopiy
        7. 110208respfinalstip.pdf
          1.72 MB
          Yauheni Salopiy
        8. 130429hospauthalbanydoughccrequestadmiss.pdf
          466 kB
          Yauheni Salopiy
        9. PDFBOX-3448_LOG.txt
          21 kB
          Yauheni Salopiy

        Activity

          People

            tilman Tilman Hausherr
            Genstr Yauheni Salopiy
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: