Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1808

PDFTextStripper.getText - hight memory usage

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.8.2, 1.8.3
    • Fix Version/s: 1.8.4, 2.0.0
    • Component/s: Text extraction
    • Labels:
    • Environment:
      Windows 7
      Java jdk 1.7.0_45

      Description

      Hello,

      i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory.
      With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
      I also constat that the memory is'nt free after the getText method is called.

      You can see my code bellow:
      double virgule = Math.pow(10, 2);
      System.out.println("START - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
      PDDocument cd = PDDocument.load(file);
      System.out.println("PDDocument getNumberOfPages - Nombre de pages: " + cd.getNumberOfPages());
      System.out.println("PDDocument load - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
      String pdfText = "";
      try{
      PDFTextStripper stripper = new PDFTextStripper();
      pdfText = stripper.getText(cd);
      System.out.println("PDFTextStripper getText - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
      stripper.resetEngine();
      stripper = null;
      System.out.println("PDFTextStripper resetEngine - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
      }
      finally{
      if( cd!=null )

      { cd.close(); cd = null; System.out.println("PDDocument close - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule); }

      }
      retour = new TextField(fieldName, pdfText, Field.Store.NO);
      System.out.println("TextField - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);

      And the result into my output window:
      START - Total memory (Mo): 95.0
      PDDocument getNumberOfPages - Nombre de pages: 2676
      PDDocument load - Total memory (Mo): 121.0
      PDFTextStripper getText - Total memory (Mo): 757.0
      PDFTextStripper resetEngine - Total memory (Mo): 757.0
      PDDocument close - Total memory (Mo): 757.0
      TextField - Total memory (Mo): 757.0
      pdfText - Total memory (Mo): 757.0

      I also try to call System.gc() but the memory use is the same.

        Attachments

        1. Screenshot2014-01-21-19-51-24.png
          123 kB
          Andreas Lehmkühler
        2. netbeans_project.jpg
          283 kB
          Guyenot Jeremy
        3. 1808-snapshot.nps
          318 kB
          Guyenot Jeremy
        4. 1808-pdfbox usage.jpg
          589 kB
          Guyenot Jeremy
        5. 1808-java usage.jpg
          203 kB
          Guyenot Jeremy
        6. 1808-java char copyofrange.jpg
          601 kB
          Guyenot Jeremy
        7. 1808-java char copyof.jpg
          515 kB
          Guyenot Jeremy
        8. s50-2.png
          209 kB
          Tilman Hausherr
        9. s50-1.png
          212 kB
          Tilman Hausherr
        10. s5-2.png
          201 kB
          Tilman Hausherr
        11. s5-1.png
          200 kB
          Tilman Hausherr
        12. DOSSIER DE CANDIDATURE_001.pdf
          4.86 MB
          Guyenot Jeremy

          Issue Links

            Activity

              People

              • Assignee:
                lehmi Andreas Lehmkühler
                Reporter:
                jguyenot Guyenot Jeremy
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 72h
                  72h
                  Remaining:
                  Remaining Estimate - 72h
                  72h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified