Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2802

Out of memory issues when extracting large files (pst)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.20, 1.19.1
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Reproduced on Windows 2012 R2 and Ubuntu 18.04.

      Java: jdk1.8.0_151

       

      Description

      I have an application that extracts text from multiple files on a file share. I've been running into issues with the application running out of memory (~26g dedicated to the heap).

      I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers holding onto a large chunk of memory. The fourth one is expanded to show it is all being held by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).

      It looks like that DTD field should never be that large, I'm wondering if this is a bug with xerces instead? I can easily reproduce the issue by attempting to extract text from large .pst files.

        Attachments

        1. Selection_111.png
          356 kB
          Caleb Ott
        2. Selection_117.png
          357 kB
          Caleb Ott

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              cott@redstonecontentsolutions.com Caleb Ott
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: