Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.28
-
None
Description
Given the follwing simplified Groovy code (for succinctness over Java)
// Groovy 4.0.12 import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.pdmodel.PDPage import org.apache.pdfbox.text.PDFTextStripperByArea import java.awt.geom.Rectangle2D int GRID_WIDTH = 10 int GRID_HEIGHT = 10 PDDocument.load(new File('./test.pdf')).withCloseable { doc -> doc.pages.eachWithIndex { PDPage page, int pageIndex -> int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT) int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH) println "processing page $pageIndex, rows = $rows, columns = $columns" def rectangles = [:] (0..<rows).each {rowIndex -> (0..<columns).each { colIndex -> rectangles["${rowIndex * columns + colIndex}"] = new Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, GRID_HEIGHT) } } rectangles.each { key, rect -> PDFTextStripperByArea textStripper = new PDFTextStripperByArea() textStripper.addRegion(key, rect) textStripper.extractRegions(page) } } }
PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does not.
The test.pdf file I am using can be downloaded from Apple SEC filings page, `8-K` from https://investor.apple.com/sec-filings/default.aspx, but any 10+ page pdf with a lot of text will work.
I have attached profiler screenshots of the difference.
Thanks in advance for your help.