Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2244

excessive memory usage when parsing a large nested package file

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0, 1.15
    • Component/s: core, parser
    • Labels:
      None

      Description

      When parsing large nested files(a couple good examples are maven jars and git objects), a large number of BufferedInputStreams get generated taking up large amounts of memory with their buffers. Upon looking through the relevant code I saw that many of these allocations were coming from TikaInputStream.get(InputStream, TemporaryResources)
      which checks if the InputStream is a BufferedInputStream or ByteArrayInputStream in order to determine whether on not mark is supported. Unfortunately it is common practice to wrap InputStreams in CloseShieldInputStreams, causing it to fail even if mark is in fact supported.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              joshbooks Joshua Hight
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: