Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4580

PDFTextStripper::getText() lead to OutOfMemoryError: Java heap space

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.15
    • 2.0.17, 3.0.0 PDFBox
    • FontBox
    • None

    Description

      I just discovered a memory issue (Java heap space) that happen only if we try to use stripper.getText(pdfFile) on a pdf that has missing incorporated fonts (like the one in attachment).

       

      To replicate the issue you can use this snippet with the pdf file in attachment:

       

      import org.apache.pdfbox.pdmodel.PDDocument;
      import org.apache.pdfbox.text.PDFTextStripper;
      
      import java.io.IOException;
      import java.io.InputStream;
      
      public class OutOfMemoryExample {
      
          public static void main(String[] args) throws IOException {
      
              try(InputStream docStream = Thread.currentThread().getContextClassLoader().getResource("ceh.pdf").openStream();
                  PDDocument cd = PDDocument.load(docStream)){
      
                  PDFTextStripper stripper = new PDFTextStripper();
                  
                  // OutOfMemory here
                  String  pdfText = stripper.getText(cd);
      
                  System.out.println(pdfText);
              }
      
          }
      }
      

      Attachments

        1. ceh.pdf
          343 kB
          Guilhermo

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              Readonly Guilhermo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: