Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Won't Do
-
3.0.0 PDFBox
-
None
-
None
Description
I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of PDF files.
I attempted to merge 7500 mails in separate PDF files on Windows. Given the limitation on the max size of the command line arguments, I was merging subsets of files. I ended up with 5 large PDF files, each around 500-600MBytes. I tried to merge these 5 files but eventually merge failed after running more than 6 hours. See error log at the bottom. I have large RAM 48GBytes. PDFBox was using up 13GB of memory max. Usage was changing between 600MB and 13Gb.
I am wondering whether PDFBox could support Concatenation mode in addition to the full Merge mode. No need to create index table, etc. It could work as follow I suppose given my total lack of understanding how PDF works:
- Read first file, process and append to the target PDF file. Delete PDF data and related meta data for this file except perhaps the last page number.
- Read the second file and process in similar fashion as in the step 1
- etc
If Concatenation is possible, it would greatly reduce the cpu and memory overhead and reduce processing time.
I admit merging of such large number of PDF files is not typical but the issue is valid.
^CException in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Hashtable.rehash(Hashtable.java:419)
at java.base/java.util.Hashtable.addEntry(Hashtable.java:441)
at java.base/java.util.Hashtable.put(Hashtable.java:493)
at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481)
at org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260)
at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542)
at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982)
at org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476)
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355)
at org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339)
at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76)
at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
Respectfully,
Zbigniew