Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3700

OutOfMemoryException converting PDF to TIFF Images

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.4
    • 2.0.5, 3.0.0 PDFBox
    • Rendering
    • None

    Description

      I am using PDFBox to convert PDF documents to a series of TIFF images (one for each page). The implementation uses PDFRenderer to render each page. Things work fine when I am processing a single document in a single thread, however when I try to process multiple documents (each in its own thread) I get an OutOfMemoryException.

      In analyzing the heap dump, I see that this is caused by the images cached in DefaultResourceCache. Objects are added to the cache in PDResources, which includes a method private boolean isAllowedCache(PDXObject xobject) that is used to determine whether an PDXObject can be cached. I have extended this to filter out COSName.IMAGE, and am now able to process multiple documents in parallel.

      A proposed fix would be to include Images in the set of objects not to add to the cache. For example, the following could be added to PDResources.isAllowedCache

      Bar.java
      COSBase image =  xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
      if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE))
      {
                   return false;            
      }
      

      A possible patch is enclosed below. I would like to get a fix in for the next release.

      diff --git a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
      index 6e1e464..aa94122 100644
      — a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
      +++ b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDResources.java
      @@ -31,15 +31,15 @@
      import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
      import org.apache.pdfbox.pdmodel.font.PDFont;
      import org.apache.pdfbox.pdmodel.font.PDFontFactory;
      +import org.apache.pdfbox.pdmodel.graphics.PDXObject;
      +import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
      import org.apache.pdfbox.pdmodel.graphics.color.PDPattern;
      import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
      +import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
      import org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
      -import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;
      -import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
      import org.apache.pdfbox.pdmodel.graphics.pattern.PDAbstractPattern;
      import org.apache.pdfbox.pdmodel.graphics.shading.PDShading;
      -import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
      -import org.apache.pdfbox.pdmodel.graphics.PDXObject;
      +import org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState;

      /**

      • A set of resources available at the page/pages/stream level.
        @@ -445,6 +445,12 @@
        return false;
        }
        }
        +
        + COSBase image = xobject.getCOSObject().getDictionaryObject(COSName.SUBTYPE);
        + if (image instanceof COSName && ((COSName) image).equals(COSName.IMAGE))
        + { + return false; + }

        }
        return true;
        }

      Attachments

        1. jira-pdfbox-3700.zip
          6 kB
          Viraf Bankwalla

        Issue Links

          Activity

            People

              Unassigned Unassigned
              virafb Viraf Bankwalla
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: