Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None

      Description

      There's currently no easy way to guard against JVM crashes or excessive memory or CPU use caused by parsing very large, broken or intentionally malicious input documents. To better protect against such cases and to generally improve the manageability of resource consumption by Tika it would be great if we had a way to run Tika parsers in separate JVM processes. This could be handled either as a separate "Tika parser daemon" or as an explicitly managed pool of forked JVMs.

        Activity

        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Awesome job Jukka!

        Show
        chrismattmann Chris A. Mattmann added a comment - Awesome job Jukka!
        Hide
        jukkaz Jukka Zitting added a comment - - edited

        An initial version of this feature is now working and included in the latest trunk.

        To illustrate the improvement, here's what I'm seeing for example with one somewhat large Excel document:

        $ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
        at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
        at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
        at org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
        at org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
        at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

        The OutOfMemoryError is really troublesome in many container environments where hitting the memory limit affects all active threads, not just the one using Tika.

        With the new out-of-process parsing feature, it's possible to externalize this problem into a separate background process:

        $ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork large.xls
        Exception in thread "main" java.io.IOException: Lost connection to a forked server process
        at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
        at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
        at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

        Such normal exceptions are much easier to recover from.

        Show
        jukkaz Jukka Zitting added a comment - - edited An initial version of this feature is now working and included in the latest trunk. To illustrate the improvement, here's what I'm seeing for example with one somewhat large Excel document: $ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69) at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157) at org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145) at org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80) The OutOfMemoryError is really troublesome in many container environments where hitting the memory limit affects all active threads, not just the one using Tika. With the new out-of-process parsing feature, it's possible to externalize this problem into a separate background process: $ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork large.xls Exception in thread "main" java.io.IOException: Lost connection to a forked server process at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149) at org.apache.tika.fork.ForkClient.call(ForkClient.java:84) at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80) Such normal exceptions are much easier to recover from.
        Hide
        jukkaz Jukka Zitting added a comment -

        See http://jukkaz.wordpress.com/2010/05/27/forking-a-jvm/ for a summary of my current approach on how to achieve this.

        Show
        jukkaz Jukka Zitting added a comment - See http://jukkaz.wordpress.com/2010/05/27/forking-a-jvm/ for a summary of my current approach on how to achieve this.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        +1, this sounds like a great idea!

        We did some work on this in OODT in terms of simple external met extractors and so forth. Maybe we could follow a similar approach here. Check out:

        http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/java/gov/nasa/jpl/oodt/cas/metadata/extractors/ExternMetExtractor.java

        and

        http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/resources/examples/extern-config.xml

        as some examples of how to deal with this (NOTE, in OODT-3, we are still in the process of converting over the licenses and there are no "official" incubator releases of OODT yet, but I just wanted to let you know about it as some pointers to ways to get this done). You rock and I can't wait for this feature!

        Show
        chrismattmann Chris A. Mattmann added a comment - +1, this sounds like a great idea! We did some work on this in OODT in terms of simple external met extractors and so forth. Maybe we could follow a similar approach here. Check out: http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/java/gov/nasa/jpl/oodt/cas/metadata/extractors/ExternMetExtractor.java and http://svn.apache.org/repos/asf/incubator/oodt/cas-metadata/trunk/src/main/resources/examples/extern-config.xml as some examples of how to deal with this (NOTE, in OODT-3 , we are still in the process of converting over the licenses and there are no "official" incubator releases of OODT yet, but I just wanted to let you know about it as some pointers to ways to get this done). You rock and I can't wait for this feature!

          People

          • Assignee:
            jukkaz Jukka Zitting
            Reporter:
            jukkaz Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development