[PDFBOX-2101] Surprising memory consumption when extracting images - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.8.5
Fix Version/s: 1.8.6, 2.0.0
Component/s: Utilities
Labels:
None
Environment:
Windows 7
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Description

ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk.

On this file 4MB file http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf, if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g available, ExtractImages will work.

With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java.

Commandlines:
1.8.5:
java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf

2.0_SNAPSHOT:
java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf

Results:
1.8.5: 906 files before OOM

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
514)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
ixelMap.java:217)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
eam(PDPixelMap.java:363)
        at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
PDXObjectImage.java:254)
        at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
02)
        at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)

        at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)

2.0_SNAPSHOT: 428 files before OOM

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
        at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
        at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
SampledImageReader.java:171)
        at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
ge(SampledImageReader.java:154)
        at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
ageXObject.java:171)
        at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
31)
        at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
java:206)
        at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
a:164)
        at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

239665.pdf
29/May/14 17:45
3.90 MB
John Hewson
java.hprof.zip
29/May/14 15:11
63 kB
Tim Allison
PDFBOX-2101-714-poor.jpg
29/May/14 08:08
34 kB
Tilman Hausherr
PDFBOX-2101-298-good.jpg
29/May/14 08:08
12 kB
Tilman Hausherr

Issue Links

duplicates

PDFBOX-626 Reduce the memory impact of the COS object model

Closed

is related to

PDFBOX-2310 codeToGID NPE

Closed

PDFBOX-2323 More flexible image caching (OOM)

Closed

PDFBOX-1462 Use file backed buffer for FlateFilter?

Closed

TIKA-1294 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Closed

relates to

PDFBOX-2313 ExtractImages finds never-rendered images

Closed

TIKA-1375 Decrease memory consumption when extracting images from PDFs

Closed

(2 relates to)

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 28/May/14 16:07

Updated:: 07/Sep/14 08:19

Resolved:: 15/Jun/14 11:34