Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.8.2, 1.8.3
-
Windows 7
Java jdk 1.7.0_45
Description
Hello,
i'm trying to extract text from pdfs but i can find that the PDFTextStripper use a lot of memory.
With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
I also constat that the memory is'nt free after the getText method is called.
You can see my code bellow:
double virgule = Math.pow(10, 2);
System.out.println("START - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
PDDocument cd = PDDocument.load(file);
System.out.println("PDDocument getNumberOfPages - Nombre de pages: " + cd.getNumberOfPages());
System.out.println("PDDocument load - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
String pdfText = "";
try{
PDFTextStripper stripper = new PDFTextStripper();
pdfText = stripper.getText(cd);
System.out.println("PDFTextStripper getText - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
stripper.resetEngine();
stripper = null;
System.out.println("PDFTextStripper resetEngine - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
}
finally{
if( cd!=null )
}
retour = new TextField(fieldName, pdfText, Field.Store.NO);
System.out.println("TextField - Total memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
And the result into my output window:
START - Total memory (Mo): 95.0
PDDocument getNumberOfPages - Nombre de pages: 2676
PDDocument load - Total memory (Mo): 121.0
PDFTextStripper getText - Total memory (Mo): 757.0
PDFTextStripper resetEngine - Total memory (Mo): 757.0
PDDocument close - Total memory (Mo): 757.0
TextField - Total memory (Mo): 757.0
pdfText - Total memory (Mo): 757.0
I also try to call System.gc() but the memory use is the same.
Attachments
Attachments
Issue Links
- breaks
-
PDFBOX-2792 Text extraction ignores bookmarks
- Closed
- depends upon
-
PDFBOX-1777 memory leak in org.apache.pdfbox.cos.COSDocument
- Closed