Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5479

PDFTextStripper needs 1GB heap for a 3.6 MB pdf

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.26
    • None
    • Text extraction
    • None
    • JDK11.0.2 on MacOS 12.4

    Description

      Extracting text from the attached x.pdf:

      PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
      PDFTextStripper stripper = new PDFTextStripper();
      stripper.getText(pdDocument);

      succeeds with -Xmx1G but throws OOME with -Xmx900m

      Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains SoftReferences to lots of fonts keyed by different COSObjects;

      Attachments

        1. heapDump.png
          259 kB
          Manfred Schauer
        2. x.pdf
          3.43 MB
          Manfred Schauer

        Activity

          People

            Unassigned Unassigned
            maschau Manfred Schauer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: