Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.4
-
None
Description
I am using PDFBox to convert PDF documents to a series of TIFF images (one for each page). The implementation uses PDFRenderer to render each page. Things work fine when I am processing a single document in a single thread, however when I try to process multiple documents (each in its own thread) I get an OutOfMemoryException.
In analyzing the heap dump, I see that this is caused by the images cached in DefaultResourceCache. Objects are added to the cache in PDResources, which includes a method private boolean isAllowedCache(PDXObject xobject) that is used to determine whether an PDXObject can be cached. I have extended this to filter out COSName.IMAGE, and am now able to process multiple documents in parallel.
A proposed fix would be to include Images in the set of objects not to add to the cache. For example, the following could be added to PDResources.isAllowedCache
COSBase image = xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE); if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE)) { return false; }
A possible patch is enclosed below. I would like to get a fix in for the next release.
diff --git a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
index 6e1e464..aa94122 100644
— a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
+++ b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
@@ -31,15 +31,15 @@
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDFontFactory;
+import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
+import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
-import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
-import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
-import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
-import org.apache.pdfbox.pdmodel.graphics.PDXObject;
+import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
/**
- A set of resources available at the page/pages/stream level.
@@ -445,6 +445,12 @@
return false;
}
}
+
+ COSBase image = xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
+ if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE))
+ { + return false; + }}
return true;
}
Attachments
Attachments
Issue Links
- is related to
-
PDFBOX-4041 Memory Leak while converting pdf to images
- Closed
- relates to
-
PDFBOX-3484 Implement some caching of PDImageXObject
- Closed