Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2920

IndexOutOfBounds Exception when loading large PDF

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.8.8, 1.8.9, 1.8.10
    • Fix Version/s: None
    • Component/s: Parsing
    • Labels:
    • Environment:
      Software

      Description

      I'm getting exceptions loading large pdfs (~6-8 GB each). I've tried using PDDocument.load() and PDDocument.loadNonSeq(). I can't attach a PDF due to the file size limit of 10 Mb. If there is another way to get it to someone, I can work that out. Here is my code:

      	public static void main(String[] args) {
      		
      		LOGGER.info("Test Large PDF Load " + TEST_PDF);
      		try {
      			LOGGER.info("Create Steam");
      			InputStream is = new FileInputStream(TEST_PDF);
      			LOGGER.info("Start Load");
      			PDDocument doc = PDDocument.load(is);
      //			PDDocument doc = PDDocument.loadNonSeq(is, null);
      			LOGGER.info("Finished Load");
      			doc.close();
      			is.close();
      		} catch (IOException e) {
      			e.printStackTrace();
      		}
      	}
      

      This first error is using PDDocument.load()

      Aug 06, 2015 1:31:14 PM hp.pdfbox.test.Main main
      INFO: Test Large PDF Load D:\workspace_trunk_luna\test_pdfbox\pdfs\ELOISA ARTOLA CD17433_Indigo.pdf
      Aug 06, 2015 1:31:14 PM hp.pdfbox.test.Main main
      INFO: Create Steam
      Aug 06, 2015 1:32:44 PM hp.pdfbox.test.Main main
      INFO: Start Load
      org.apache.pdfbox.exceptions.WrappedIOException
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:278)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
      at hp.pdfbox.test.Main.main(Main.java:22)
      Caused by: java.lang.IndexOutOfBoundsException: Index: 1041, Size: 1041
      at java.util.ArrayList.rangeCheck(Unknown Source)
      at java.util.ArrayList.get(Unknown Source)
      at org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:110)
      at org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:106)
      at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
      at java.io.BufferedOutputStream.flush(Unknown Source)
      at java.io.FilterOutputStream.close(Unknown Source)
      at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:616)
      at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:650)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
      ... 3 more

      This error was using PDDocument.loadNonSeq()

      INFO: Create Steam
      Aug 06, 2015 1:51:47 PM hp.pdfbox.test.Main main
      INFO: Start Load
      Aug 06, 2015 1:53:39 PM org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
      WARNING: Did not found XRef object at specified startxref position 8552119825
      Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 509, Size: 509
      at java.util.ArrayList.rangeCheck(Unknown Source)
      at java.util.ArrayList.get(Unknown Source)
      at org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:110)
      at org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:106)
      at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
      at java.io.BufferedOutputStream.flush(Unknown Source)
      at java.io.FilterOutputStream.close(Unknown Source)
      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseCOSStream(NonSequentialPDFParser.java:1847)
      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1448)
      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1374)
      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1348)
      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:429)
      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:915)
      at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1305)
      at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1288)
      at hp.pdfbox.test.Main.main(Main.java:22)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                beta-brad Brad Baker
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: