[PDFBOX-4182] Improve memory usage of PDFMergerUtility - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.9
Fix Version/s: None
Component/s: None
Labels:
None

Flags:

Important

Description

I have been running some tests trying to merge large amounts (2618) of small pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)

Memory consumption seems to be the main limitation.

ScratchFileBuffer seems to consume the majority of the memory usage.

(see screenshot from mat in attachment)

(I would include the hprof in attachment so you can analyze yourselves but it's rather large)

Note that it seems impossible to generate a large pdf using a small memory footprint.

I personally thought that using MemorySettings with temporary file only would allow me to generate arbitrarily large pdf files but it doesn't seem to help.

I've run the mergeDocuments with memory settings:

MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 1024L)

MemoryUsageSetting.setupTempFileOnly()

Refactored version completes with 4GB heap:

with temp file only completes 2618 documents in 1.760 min

VS

8GB heap:

with temp file only completes 2618 documents in 2.0 min

Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 8GB)

It looks like the loop in the mergeDocuments accumulates PDDocument objects in a list which are closed after the merge is completed.

Refactoring the code to close these as they are used, instead of accumulating them and closing all at the end, improves memory usage considerably.(although doesn't seem to be eliminated completed based on mat analysis.)

Another change I've implemented is to only create the inputstream when the file needs to be read and to close it alongside the PDDocument.

(Some inputstreams contain buffers and depending on the size of the buffers and or the stream type accumulating all the streams is a potential memory-hog.)

These changes seems to have a beneficial improvement in the sense that I can process the same amount of pdfs with about half the memory.

I'd appreciate it if you could roll these changes into the main codebase.

(I've respected java 6 compatibility.)

I've included in attachment the java files of the new implementation:

Suppliers
Supplier
PDFMergerUtilityUsingSupplier

PDFMergerUtilityUsingSupplier can replace the previous version. No signature changes only internal code changes. (just rename the class to PDFMergerUtility if you decide to implemented the changes.)

In attachment you can also find some screenshots from visualvm showing the memory usage of the original version and the refactored version as well as some info produced by mat after analysing the heap.

If you know of any other means, without running into memory issues, to merge large sets of pdf files into a large single pdf I'd love to hear about it!

I'd also suggest that there should be further improvements made in memory usage in general as pdfbox seems to consumer a lot of memory in general.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

oom-2gb-heap-after-refactoring-leak-suspect-1.png
05/Apr/18 14:48
35 kB
Pas Filip
oom-2gb-heap-after-refactoring-leak-suspect-2.png
05/Apr/18 14:48
159 kB
Pas Filip
Suppliers.java
05/Apr/18 14:48
2 kB
Pas Filip
Supplier.java
05/Apr/18 14:48
0.1 kB
Pas Filip
PDFMergerUtilityUsingSupplier.java
05/Apr/18 14:48
33 kB
Pas Filip
failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png
05/Apr/18 15:07
79 kB
Pas Filip
successful - refactored-merge-utility-4gb-heap-2618-files-merged.png
05/Apr/18 15:07
81 kB
Pas Filip
successful -merge-utility-6gb-heap-2618-files-merged.png
05/Apr/18 15:07
85 kB
Pas Filip
successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png
05/Apr/18 15:07
171 kB
Pas Filip
successful-merge-utility-8gb-heap-2618-files-merged.png
05/Apr/18 15:07
162 kB
Pas Filip
successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
05/Apr/18 15:07
146 kB
Pas Filip
merge-pdf-stats.xlsx
06/Apr/18 16:10
91 kB
Pas Filip
merge-utility.patch
10/Apr/18 20:27
15 kB
Pas Filip

Issue Links

is cloned by

PDFBOX-4188 "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs

Open

Improve memory usage of PDFMergerUtility

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates