Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11721

Isolate most of Tika and dependencies into separate jvm

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Tika should not be run in the same jvm as Solr. Ever.

      Upgrading Tika and hoping to avoid jar hell, while getting all of the dependencies right manually is, um, error prone. See my recent failure: SOLR-11622, for which I apologize profusely.

      Running DIH against Tika's unit test documents has been eye-opening. It has revealed some other version conflict/dependency failures that should have been caught much earlier.

      The fix is non-trivial, but we should work towards it.
      I see two options:

      1. TIKA-2514 – Our current ForkParser offers a model for a minimal fork process + server option. The limitation currently is that all parsers and dependencies must be serializable, which can be a problem for users adding their own parsers with deps that might not be designed for serializability. The proposal there is to rework the ForkParser to use a TIKA_HOME directory for all dependencies.

      2. SOLR-7632 – use tika-server, but make it seamless and as easy (and secure!) to use as the current handlers.

      Other thoughts, recommendations?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@apache.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: