Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3449

NullPointerException at org.apache.pdfbox.pdmodel.PDPageTree.isPageTreeNode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.0.2
    • None
    • PDModel, Text extraction

    Description

      A number of valid PDF documents failing in Apache Tika 1.14-SNAPSHOT (PDF Box 2.0.2) on text extraction with following exception:

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@389adf1d
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
      at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.extractText(DocumentsTextExtractor.java:44)
      at com.wolterskluwer.atlas.transformer.processFileResources.DocumentsTextExtractor.main(DocumentsTextExtractor.java:134)
      Caused by: java.lang.NullPointerException
      at org.apache.pdfbox.pdmodel.PDPageTree.isPageTreeNode(PDPageTree.java:307)
      at org.apache.pdfbox.pdmodel.PDPageTree.access$100(PDPageTree.java:38)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:164)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:159)
      at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.<init>(PDPageTree.java:153)
      at org.apache.pdfbox.pdmodel.PDPageTree.iterator(PDPageTree.java:123)
      at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:314)
      at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:112)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:151)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      ... 6 more

      Please, find failing documents and log with exceptions StackTrace in attachments.

      Attachments

        1. R416PI_20121220.txt
          29 kB
          Maruan Sahyoun
        2. R2521CP_20121112.txt
          13 kB
          Maruan Sahyoun
        3. R2464CP_20121123.txt
          4 kB
          Maruan Sahyoun
        4. R425PI_20120713.txt
          3 kB
          Maruan Sahyoun
        5. R412PI_20120813.txt
          11 kB
          Maruan Sahyoun
        6. r1587cp.txt
          28 kB
          Maruan Sahyoun
        7. R2521CP_20121112.pdf
          229 kB
          Yauheni Salopiy
        8. R2464CP_20121123.pdf
          48 kB
          Yauheni Salopiy
        9. r1587cp.pdf
          192 kB
          Yauheni Salopiy
        10. R425PI_20120713.pdf
          41 kB
          Yauheni Salopiy
        11. R416PI_20121220.pdf
          419 kB
          Yauheni Salopiy
        12. R412PI_20120813.pdf
          250 kB
          Yauheni Salopiy
        13. PDFBOX-3449_LOG.txt
          12 kB
          Yauheni Salopiy

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Genstr Yauheni Salopiy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: