Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5602

Consider adding support for PDF files Concatenation in addition to the full Merge

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Do
    • 3.0.0 PDFBox
    • None
    • Utilities
    • None

    Description

      I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of PDF files.

      I attempted to merge 7500 mails in separate PDF files on Windows. Given the limitation on the max size of the command line arguments, I was merging subsets of files. I ended up with 5 large PDF files, each around 500-600MBytes. I tried to merge these 5 files but eventually merge failed after running more than 6 hours.  See error log at the bottom. I have large RAM 48GBytes.  PDFBox was using up 13GB of memory max. Usage was changing between 600MB and 13Gb. 

      I am wondering whether PDFBox could support Concatenation mode in addition to the full Merge mode.  No need to create index table, etc. It could work as follow I suppose given my total lack of understanding how PDF works:

      1. Read first file, process and append to the target PDF file. Delete PDF data and related meta data for this file except perhaps the last page number.
      2. Read the second file and process in similar fashion as in the step 1
      3. etc

      If Concatenation is possible, it would greatly reduce the cpu and memory overhead and reduce processing time.

      I admit merging of such large number of PDF files is not typical but the issue is valid.

      ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space
          at java.base/java.util.Hashtable.rehash(Hashtable.java:419)
          at java.base/java.util.Hashtable.addEntry(Hashtable.java:441)
          at java.base/java.util.Hashtable.put(Hashtable.java:493)
          at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481)
          at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260)
          at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402)
          at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542)
          at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418)
          at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018)
          at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963)
          at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982)
          at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476)
          at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355)
          at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339)
          at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76)
          at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37)
          at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
          at picocli.CommandLine.access$1300(CommandLine.java:145)
          at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
          at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
          at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
          at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
          at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
          at picocli.CommandLine.execute(CommandLine.java:2078)
          at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)

      Respectfully,

      Zbigniew

       

       

      Attachments

        1. CapturePdfDebugger.PNG
          32 kB
          Zbigniew Minciel
        2. cpu-20-middle.PNG
          162 kB
          Zbigniew Minciel
        3. cpu-hot-spots-3.0.0-alpha3.PNG
          162 kB
          Zbigniew Minciel
        4. cpu-hot-spots-3.0.0-alpha3-1.PNG
          154 kB
          Zbigniew Minciel
        5. cpu-hot-spots-3.0.0-SNAPSHOT.PNG
          151 kB
          Zbigniew Minciel
        6. cpu-hot-spots-3.0.0-SNAPSHOT-1.PNG
          157 kB
          Zbigniew Minciel
        7. Large527MbytesPDF.PNG
          33 kB
          Zbigniew Minciel

        Activity

          People

            Unassigned Unassigned
            ziggym Zbigniew Minciel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: