Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2802

Out of memory issues when extracting large files (pst)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.20, 1.19.1
    • None
    • parser
    • None
    • Reproduced on Windows 2012 R2 and Ubuntu 18.04.

      Java: jdk1.8.0_151

       

    Description

      I have an application that extracts text from multiple files on a file share. I've been running into issues with the application running out of memory (~26g dedicated to the heap).

      I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers holding onto a large chunk of memory. The fourth one is expanded to show it is all being held by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).

      It looks like that DTD field should never be that large, I'm wondering if this is a bug with xerces instead? I can easily reproduce the issue by attempting to extract text from large .pst files.

      Attachments

        1. Selection_111.png
          356 kB
          Caleb Ott
        2. Selection_117.png
          357 kB
          Caleb Ott

        Activity

          People

            Unassigned Unassigned
            cott@redstonecontentsolutions.com Caleb Ott
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: