Tika
  1. Tika
  2. TIKA-591

Separate launcer process for forking JVMs

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      As a followup to TIKA-416, it would be good to implement at least optional support for a separate launcher process for the ForkParser feature. The need for such an extra process came up in JCR-2864 where a reference to http://developers.sun.com/solaris/articles/subprocess/subprocess.html was made.

      To summarize, the problem is that the ProcessBuilder.start() call can result in a temporary duplication of the memory space of the parent JVM. Even with copy-on-write semantics this can be a fairly expensive operation and prone to out-of-memory issues especially in large-scale deployments where the parent JVM already uses the majority of the available RAM on a computer.

      A similar problem is also being discussed at HADOOP-5059.

        Activity

        Jukka Zitting made changes -
        Field Original Value New Value
        Assignee Jukka Zitting [ jukkaz ]
        Hide
        Tyler Palsulich added a comment -

        I bring up tika-batch (from Tim Allison) because it's meant to provide a way to reliably run Tika on a large collection of documents – killing the processing when Tika seems to be hanging indefinitely. But, I'm not sure if it's in an entirely different JVM, or just a different thread – or if that even matters in regards to this issue.

        Show
        Tyler Palsulich added a comment - I bring up tika-batch (from Tim Allison ) because it's meant to provide a way to reliably run Tika on a large collection of documents – killing the processing when Tika seems to be hanging indefinitely. But, I'm not sure if it's in an entirely different JVM, or just a different thread – or if that even matters in regards to this issue.
        Hide
        Luis Filipe Nassif added a comment -

        I think this is very important. We are having problems on Linux that I think are related to this while running the TesseractOCRParser. Sometimes the trace is similar to those posted in HADOOP-5059, sometimes it is outside of TesseractOCRParser, but I think it is related to a memory corruption caused by an early fork/exec. Reducing the max heap of the JVM helps a bit, but does not solve the issue. I don't know the tika-batch code, is it possible to use CompositeParser directly with tika-batch?

        Show
        Luis Filipe Nassif added a comment - I think this is very important. We are having problems on Linux that I think are related to this while running the TesseractOCRParser. Sometimes the trace is similar to those posted in HADOOP-5059 , sometimes it is outside of TesseractOCRParser, but I think it is related to a memory corruption caused by an early fork/exec. Reducing the max heap of the JVM helps a bit, but does not solve the issue. I don't know the tika-batch code, is it possible to use CompositeParser directly with tika-batch?
        Hide
        Tyler Palsulich added a comment -

        Is there still interest in this, or is it superseded by tika-batch?

        Show
        Tyler Palsulich added a comment - Is there still interest in this, or is it superseded by tika-batch?
        Jukka Zitting created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development