Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: indexing, jackrabbit-core
    • Labels:
      None

      Description

      The upcoming Tika 0.9 release will contain a highly useful out-of-process text extraction feature (TIKA-416) that we should use also in Jackrabbit.

        Activity

        Hide
        Jukka Zitting added a comment -

        I've implemented this by adding a new "forkJavaCommand" configuration option. When specified, this command will be used to fork background text extraction processes up to the configured "extractorPoolSize" limit (by default 2 x processor count). The pool size limit will only become effective once we upgrade to Tika 1.0 where TIKA-639 is implemented.

        The fork memory use issue is best handled within the Tika issue.

        Show
        Jukka Zitting added a comment - I've implemented this by adding a new "forkJavaCommand" configuration option. When specified, this command will be used to fork background text extraction processes up to the configured "extractorPoolSize" limit (by default 2 x processor count). The pool size limit will only become effective once we upgrade to Tika 1.0 where TIKA-639 is implemented. The fork memory use issue is best handled within the Tika issue.
        Hide
        Jukka Zitting added a comment -

        I filed TIKA-591 for tracking this within Tika. See also HADOOP-5059 for a similar problem in the Hadoop land.

        Show
        Jukka Zitting added a comment - I filed TIKA-591 for tracking this within Tika. See also HADOOP-5059 for a similar problem in the Hadoop land.
        Hide
        Sébastien Launay added a comment -

        Yes it definitely needs to be handle in Tika for other projects using it.
        We happened to have this issue in a webapp (1 tomcat with 1 webapp consuming 60% of RAM+Swap and -Xmx == -Xms) and a shell script for launching custom data processing executed from the JVM with ProcessBuilder.
        I guess this can become problematic when we have an enterprise server with multiple webapps (one or several using JR/Tika).

        Show
        Sébastien Launay added a comment - Yes it definitely needs to be handle in Tika for other projects using it. We happened to have this issue in a webapp (1 tomcat with 1 webapp consuming 60% of RAM+Swap and -Xmx == -Xms) and a shell script for launching custom data processing executed from the JVM with ProcessBuilder. I guess this can become problematic when we have an enterprise server with multiple webapps (one or several using JR/Tika).
        Hide
        Jukka Zitting added a comment -

        Hmm, good point about subprocess creation. That should probably be handled in Tika though I'm not sure where the proper tradeoff between complexity and fail-safety lies. Can we somehow estimate how likely the described fork() memory issues would be in Jackrabbit deployments.

        Show
        Jukka Zitting added a comment - Hmm, good point about subprocess creation. That should probably be handled in Tika though I'm not sure where the proper tradeoff between complexity and fail-safety lies. Can we somehow estimate how likely the described fork() memory issues would be in Jackrabbit deployments.
        Hide
        Sébastien Launay added a comment -

        I like this isolation because we can also kill parsers blocked into a infinite loop.

        But there might be an issue at least on GNU/Linux systems (vanilla kernel) because of how processes are created.
        Indeed, creating a process required the same amount of free memory as the memory consumed by the parent process (fork() then exec()).
        This can be a big issue for enterprise server where a lot of memory is dedicated to the JVM and creating subprocesses would result in the system swapping or error like "Cannot run program "java": java.io.IOException: error=12, Cannot allocate memory".

        A solution often recommended is to start one light process parallel to the main one and communicate with it (e.g. by socket) to create the subprocesses.

        More information can be found here:
        http://developers.sun.com/solaris/articles/subprocess/subprocess.html

        Show
        Sébastien Launay added a comment - I like this isolation because we can also kill parsers blocked into a infinite loop. But there might be an issue at least on GNU/Linux systems (vanilla kernel) because of how processes are created. Indeed, creating a process required the same amount of free memory as the memory consumed by the parent process (fork() then exec()). This can be a big issue for enterprise server where a lot of memory is dedicated to the JVM and creating subprocesses would result in the system swapping or error like "Cannot run program "java": java.io.IOException: error=12, Cannot allocate memory". A solution often recommended is to start one light process parallel to the main one and communicate with it (e.g. by socket) to create the subprocesses. More information can be found here: http://developers.sun.com/solaris/articles/subprocess/subprocess.html

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development