Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2244

excessive memory usage when parsing a large nested package file

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0, 1.15
    • Component/s: core, parser
    • Labels:
      None

      Description

      When parsing large nested files(a couple good examples are maven jars and git objects), a large number of BufferedInputStreams get generated taking up large amounts of memory with their buffers. Upon looking through the relevant code I saw that many of these allocations were coming from TikaInputStream.get(InputStream, TemporaryResources)
      which checks if the InputStream is a BufferedInputStream or ByteArrayInputStream in order to determine whether on not mark is supported. Unfortunately it is common practice to wrap InputStreams in CloseShieldInputStreams, causing it to fail even if mark is in fact supported.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Good enough. Thank you for diagnosing that inefficiency. Let us know what else you find!

        Show
        tallison@mitre.org Tim Allison added a comment - Good enough. Thank you for diagnosing that inefficiency. Let us know what else you find!
        Hide
        joshbooks Joshua Hight added a comment -

        Sweet! Thanks for making open source awesome. I can't give you all the details, but I was using yourkit memory snapshots, and I definitely noticed a difference after the change. If you're having difficulty reproducing I found that parsing git FETCH_HEAD files for large repositories tends to trigger the behavior

        Show
        joshbooks Joshua Hight added a comment - Sweet! Thanks for making open source awesome. I can't give you all the details, but I was using yourkit memory snapshots, and I definitely noticed a difference after the change. If you're having difficulty reproducing I found that parsing git FETCH_HEAD files for large repositories tends to trigger the behavior
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1187 (See https://builds.apache.org/job/Tika-trunk/1187/)
        TIKA-2244 – be more parsimonious with BufferedInputStream – (tallison: rev 836e2d9040bacd7a8ffe88095187aa28ff26c6ba)

        • (edit) tika-core/src/main/java/org/apache/tika/detect/AutoDetectReader.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1187 (See https://builds.apache.org/job/Tika-trunk/1187/ ) TIKA-2244 – be more parsimonious with BufferedInputStream – (tallison: rev 836e2d9040bacd7a8ffe88095187aa28ff26c6ba) (edit) tika-core/src/main/java/org/apache/tika/detect/AutoDetectReader.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Updated AutoDetectReader. Thank you for opening this.

        Out of curiosity, if you can share any metrics on decreased memory consumption that this change helped with, that'd be great! Did you use hprof or other memory profiling tool to see a difference btwn before/after this change?

        Show
        tallison@mitre.org Tim Allison added a comment - Updated AutoDetectReader. Thank you for opening this. Out of curiosity, if you can share any metrics on decreased memory consumption that this change helped with, that'd be great! Did you use hprof or other memory profiling tool to see a difference btwn before/after this change?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Just wanted confirmation. I'll fix those prob early next week. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Just wanted confirmation. I'll fix those prob early next week. Thank you!
        Hide
        joshbooks Joshua Hight added a comment - - edited

        Tim Allison yep, it looks like AutoDetectReader and MidiParser are the only other two places doing this sort of thing, shall I submit another PR? if so would you prefer I create another bug or reference this one?

        Show
        joshbooks Joshua Hight added a comment - - edited Tim Allison yep, it looks like AutoDetectReader and MidiParser are the only other two places doing this sort of thing, shall I submit another PR? if so would you prefer I create another bug or reference this one?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1185 (See https://builds.apache.org/job/Tika-trunk/1185/)
        TIKA-2244 – be more parsimonious with BufferedInputStream. This closes (tallison: rev 4cc15e2a3f813dce8001c1eb4aae712b05c557d4)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/audio/MidiParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
        • (edit) CHANGES.txt
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1185 (See https://builds.apache.org/job/Tika-trunk/1185/ ) TIKA-2244 – be more parsimonious with BufferedInputStream. This closes (tallison: rev 4cc15e2a3f813dce8001c1eb4aae712b05c557d4) (edit) tika-parsers/src/main/java/org/apache/tika/parser/audio/MidiParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java (edit) CHANGES.txt (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #204 (See https://builds.apache.org/job/tika-2.x/204/)
        TIKA-2244 – be more parsimonious with BufferedInputStream via Josh (tallison: rev 78828176a5da851c555f09688075983dfe4d8e49)

        • (edit) tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java
        • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
        • (edit) CHANGES.txt
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/audio/MidiParser.java
        • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #204 (See https://builds.apache.org/job/tika-2.x/204/ ) TIKA-2244 – be more parsimonious with BufferedInputStream via Josh (tallison: rev 78828176a5da851c555f09688075983dfe4d8e49) (edit) tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/audio/MidiParser.java (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        PR committed. Thank you! I found a few other places to check markSupported before wrapping in a new BufferedInputStream in tika-parsers.

        Should we also update AutoDetectReader to check for markSupported before wrapping?

        Show
        tallison@mitre.org Tim Allison added a comment - PR committed. Thank you! I found a few other places to check markSupported before wrapping in a new BufferedInputStream in tika-parsers. Should we also update AutoDetectReader to check for markSupported before wrapping?

          People

          • Assignee:
            Unassigned
            Reporter:
            joshbooks Joshua Hight
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development