Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2972

Allow users to specify a list/map of ContentHandlerFactories in tika-config.xml

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I'd like to add a tika-eval handler that will calculate text stats at the end of parsing a document so that the user can get a unified/simpler view of number of tokens/ out of vocabulary, etc. in the metadata rather than having to run their own post-parse process on the content.

      The problem comes with integrating this into tika-app and tika-server – tika-app balloons to 134MB. I don't want to nearly double the size of tika-app just so that I can add some stuff that very few folks will use.

      I think we've discussed this option before, but it would be handy to allow users to specify a ContentHandlerFactory or possibly a map of ContentHandlerFactories in tika-config.xml so that users can get custom handling in tika-app and tika-server.

      The idea of a map of ContentHandlerFactories, would be to have a name for each content handler factory, and a user could call different handlers on tika-server like this:

      curl... http://localhost:9998/tika/custom/myhandler1
      curl... http://localhost:9998/tika/custom/myhandler2

      That's not right because we'd want to differentiate classic Tika parsing and the RecursiveParserWrapper...

      curl... http://localhost:9998/tika/myhandler1
      curl... http://localhost:9998/tika/myhandler2

      curl... http://localhost:9998/rmeta/myhandler1
      curl... http://localhost:9998/rmeta/myhandler2

      or in tika-app:

      java -jar tika-app.jar --handlerFactory=myhandler1...

      WDYT?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: