Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-695

COSStream doesn't actually stream tokens, causing OOM in larger PDF text extraction

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: Text extraction
    • Labels:
    • Environment:
      All

      Description

      Text extraction of certain pdfs has been hanging and/or OOMing. Profiling revealed that PDFStreamEngine.processSubStream() eventually calls PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens. In some cases, this can use over 1GB of memory.

      The attached patch replaces PDFStreamParser.getTokens() with PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the ArrayList build. It only uses this in the call path of org.apache.pdfbox.ExtractText, so the fix may not benefit other usages. Also, API used by the fix may not be ideal.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              fizx Kyle Maxwell
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: