Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11721

Isolate most of Tika and dependencies into separate jvm

    XMLWordPrintableJSON

Details

    Description

      Tika should not be run in the same jvm as Solr. Ever.

      Upgrading Tika and hoping to avoid jar hell, while getting all of the dependencies right manually is, um, error prone. See my recent failure: SOLR-11622, for which I apologize profusely.

      Running DIH against Tika's unit test documents has been eye-opening. It has revealed some other version conflict/dependency failures that should have been caught much earlier.

      The fix is non-trivial, but we should work towards it.
      I see two options:

      1. TIKA-2514 – Our current ForkParser offers a model for a minimal fork process + server option. The limitation currently is that all parsers and dependencies must be serializable, which can be a problem for users adding their own parsers with deps that might not be designed for serializability. The proposal there is to rework the ForkParser to use a TIKA_HOME directory for all dependencies.

      2. SOLR-7632 – use tika-server, but make it seamless and as easy (and secure!) to use as the current handlers.

      Other thoughts, recommendations?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: