Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-695

COSStream doesn't actually stream tokens, causing OOM in larger PDF text extraction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.2.0
    • Text extraction
    • All

    Description

      Text extraction of certain pdfs has been hanging and/or OOMing. Profiling revealed that PDFStreamEngine.processSubStream() eventually calls PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens. In some cases, this can use over 1GB of memory.

      The attached patch replaces PDFStreamParser.getTokens() with PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the ArrayList build. It only uses this in the call path of org.apache.pdfbox.ExtractText, so the fix may not benefit other usages. Also, API used by the fix may not be ideal.

      Attachments

        Activity

          People

            Unassigned Unassigned
            fizx Kyle Maxwell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: