PDFBox
  1. PDFBox
  2. PDFBOX-202

Error on text extraction: java.lang.IndexOutOfBoundsExceptio

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: Parsing
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1565617
      Originally submitted by gagravarr on 2006-09-26 03:30.

      I'm trying to extract text from a pdf file
      (http://www.cifor.cgiar.org/mla/download/publication/mozambique.pdf),
      but I'm getting an IndexOutOfBoundsException on it:

      Exception in thread "main"
      java.lang.IndexOutOfBoundsException: Index: 4, Size: 4
      at
      java.util.ArrayList.RangeCheck(ArrayList.java:546)
      at java.util.ArrayList.get(ArrayList.java:321)
      at
      org.pdfbox.util.operator.Concatenate.process(Concatenate.java:69)
      at
      org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:494)
      at
      org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:207)
      at
      org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
      at
      org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
      at
      org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
      at
      org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
      at
      org.pdfbox.ExtractText.main(ExtractText.java:237)

      I've tried with 0.7.2, and 0.7.3-dev-20060920, and I
      get the same exception from both versions.

      Nick

      1. mozambique.pdf
        6.41 MB
        Jukka Zitting

        Activity

        Hide
        Jukka Zitting added a comment -

        The exact IndexOutOfBoundsException seems to have been solved along the way, but text extraction with this document still fails with a NullPointerException when using the latest trunk version:

        $ java -jar pdfbox-app-1.4.0-SNAPSHOT.jar ExtractText mozambique.pdf
        07.12.2010 07:29:54 org.apache.pdfbox.pdfparser.BaseParser parseCOSDictionary
        WARNUNG: Bad Dictionary Declaration org.apache.pdfbox.io.PushBackInputStream@27ce2dd4
        07.12.2010 07:29:54 org.apache.pdfbox.pdfparser.BaseParser parseCOSDictionary
        WARNUNG: Invalid dictionary, found: '?' but expected: '/'
        ExtractText failed with the following exception:
        java.lang.NullPointerException
        at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:187)
        at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:175)
        at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:211)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:237)
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)

        I've attached the referenced PDF document for the record.

        Show
        Jukka Zitting added a comment - The exact IndexOutOfBoundsException seems to have been solved along the way, but text extraction with this document still fails with a NullPointerException when using the latest trunk version: $ java -jar pdfbox-app-1.4.0-SNAPSHOT.jar ExtractText mozambique.pdf 07.12.2010 07:29:54 org.apache.pdfbox.pdfparser.BaseParser parseCOSDictionary WARNUNG: Bad Dictionary Declaration org.apache.pdfbox.io.PushBackInputStream@27ce2dd4 07.12.2010 07:29:54 org.apache.pdfbox.pdfparser.BaseParser parseCOSDictionary WARNUNG: Invalid dictionary, found: '?' but expected: '/' ExtractText failed with the following exception: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:187) at org.apache.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:175) at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:211) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.pdfbox.ExtractText.main(ExtractText.java:237) at org.apache.pdfbox.PDFBox.main(PDFBox.java:42) I've attached the referenced PDF document for the record.
        Hide
        Adam Nichols added a comment -

        First, I tested ExtractText.main(new String[]

        {"C:\\Temp\\PDFBOX-202\\mozambique.pdf"}

        ); and it did not throw any exceptions with the current HEAD tag (this includes two patches I made today for protecting against NPE). So this is fixed in the current head tag.

        No text is extracted in the txt file, but since Adobe Acrobat Standard 8, this is expected. It's a corrupt PDF, so there's not much we can do with it, but it's good that it doesn't throw an exception anymore.

        Show
        Adam Nichols added a comment - First, I tested ExtractText.main(new String[] {"C:\\Temp\\PDFBOX-202\\mozambique.pdf"} ); and it did not throw any exceptions with the current HEAD tag (this includes two patches I made today for protecting against NPE). So this is fixed in the current head tag. No text is extracted in the txt file, but since Adobe Acrobat Standard 8, this is expected. It's a corrupt PDF, so there's not much we can do with it, but it's good that it doesn't throw an exception anymore.

          People

          • Assignee:
            Adam Nichols
            Reporter:
            Anonymous
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development