Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3449

NullPointerException at org.apache.pdfbox.pdmodel.PDPageTree.isPageTreeNode

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.2
    • Fix Version/s: None
    • Component/s: PDModel, Text extraction
    • Labels:

      Description

      A number of valid PDF documents failing in Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) on text extraction with following exception:

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@389adf1d
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
      at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134)
      Caused by: java.lang.NullPointerException
      at org.apache.pdfbox.pdmodel.PDPageTree.isPageTreeNode(PDPageTree.java:307)
      at org.apache.pdfbox.pdmodel.PDPageTree.access$100(PDPageTree.java:38)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:164)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:159)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:153)
      at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123)
      at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314)
      at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:112)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:151)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      ... 6 more

      Please, find failing documents and log with exceptions StackTrace in attachments.

        Attachments

        1. PDFBOX-3449_LOG.txt
          12 kB
          Yauheni Salopiy
        2. R412PI_20120813.pdf
          250 kB
          Yauheni Salopiy
        3. R416PI_20121220.pdf
          419 kB
          Yauheni Salopiy
        4. R425PI_20120713.pdf
          41 kB
          Yauheni Salopiy
        5. r1587cp.pdf
          192 kB
          Yauheni Salopiy
        6. R2464CP_20121123.pdf
          48 kB
          Yauheni Salopiy
        7. R2521CP_20121112.pdf
          229 kB
          Yauheni Salopiy

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Genstr Yauheni Salopiy
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: