Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5606

PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.28
    • 2.0.29, 3.0.0 PDFBox
    • None

    Description

      Given the follwing simplified Groovy code (for succinctness over Java)

       

      // Groovy 4.0.12
      import org.apache.pdfbox.pdmodel.PDDocument
      import org.apache.pdfbox.pdmodel.PDPage
      import org.apache.pdfbox.text.PDFTextStripperByArea
      import java.awt.geom.Rectangle2D
      
      int GRID_WIDTH = 10
      int GRID_HEIGHT = 10
      
      PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
          doc.pages.eachWithIndex { PDPage page, int pageIndex ->
              int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
              int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
              println "processing page $pageIndex, rows = $rows, columns = $columns"
              def rectangles = [:]
              (0..<rows).each {rowIndex ->
                  (0..<columns).each { colIndex ->
                      rectangles["${rowIndex * columns + colIndex}"] = new Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, GRID_HEIGHT)
                  }
              }
              rectangles.each { key, rect ->
                  PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
                  textStripper.addRegion(key, rect)
                  textStripper.extractRegions(page)
              }
          }
      }

       

       

      PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does not. 

      The test.pdf file I am using can be downloaded from Apple SEC filings page, `8-K` from https://investor.apple.com/sec-filings/default.aspx, but any 10+ page pdf with a lot of text will work. 

      I have attached profiler screenshots of the difference. 

      Thanks in advance for your help. 

      Attachments

        1. pdfbox-2.0.28.png
          129 kB
          Joe Li
        2. pdfbox-2.0.27.png
          168 kB
          Joe Li
        3. 590031dc-2131-4a00-a936-d1175b7b926c.pdf
          571 kB
          Tilman Hausherr
        4. screenshot-1.png
          30 kB
          Tilman Hausherr
        5. screenshot-2.png
          141 kB
          Tilman Hausherr
        6. 819127-p1.pdf
          18 kB
          Tilman Hausherr

        Activity

          People

            lehmi Andreas Lehmkühler
            astrogg Joe Li
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: