Tika
  1. Tika
  2. TIKA-317

Service provider -based Tika configuration

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7
    • Component/s: parser
    • Labels:
      None

      Description

      I'd like to simplify Tika configuration and make it easier to customize by pushing the information in tika-config.xml to Parser annotations and Java SPI service files.

      1. TIKA-317.patch
        59 kB
        Jukka Zitting

        Activity

        Hide
        Chris A. Mattmann added a comment -

        Hey Jukka: could you explain how this will be simpler? I, personally, like the tika-config.xml file. Details, please

        Show
        Chris A. Mattmann added a comment - Hey Jukka: could you explain how this will be simpler? I, personally, like the tika-config.xml file. Details, please
        Hide
        Benson Margulies added a comment -

        I'm with Jukka. I needed to replace one processor. Having to copy and modify the xml file, and then forever maintain my mutant version as new Tika releases change the rest of the contents that I don't want to change, is not a good prospect.

        Show
        Benson Margulies added a comment - I'm with Jukka. I needed to replace one processor. Having to copy and modify the xml file, and then forever maintain my mutant version as new Tika releases change the rest of the contents that I don't want to change, is not a good prospect.
        Hide
        Jukka Zitting added a comment -

        As Benson mentioned, a pretty typical deployment scenario is one where you want to extend Tika with a few custom Parser classes. Currently you'd either need to maintain a custom version of the full configuration file, or do some CompositeParser magic to inject your custom parsers at runtime. Neither option is ideal.

        Another concern of mine is that the current configuration mechanism disconnects the list of supported media types from the parser implementation class. It would be better if that list was maintained in the same Java source file instead of in the XML configuration.

        Thinking further, there's some interest in making Tika easy to use in more dynamic environments like an OSGi container where new parser components may be added to or removed from the system at any time. A static configuration file does not work that well in such situations.

        So my idea is to move the list of media types supported by a Parser class to a class annotation (or perhaps a getSupportedTypes() method that would work better with composite parsers) and replace the tika-config.xml file with a META-INF/services/org.apache.tika.parser.Parser file that simply lists all the Parser implementations within that jar file.

        Show
        Jukka Zitting added a comment - As Benson mentioned, a pretty typical deployment scenario is one where you want to extend Tika with a few custom Parser classes. Currently you'd either need to maintain a custom version of the full configuration file, or do some CompositeParser magic to inject your custom parsers at runtime. Neither option is ideal. Another concern of mine is that the current configuration mechanism disconnects the list of supported media types from the parser implementation class. It would be better if that list was maintained in the same Java source file instead of in the XML configuration. Thinking further, there's some interest in making Tika easy to use in more dynamic environments like an OSGi container where new parser components may be added to or removed from the system at any time. A static configuration file does not work that well in such situations. So my idea is to move the list of media types supported by a Parser class to a class annotation (or perhaps a getSupportedTypes() method that would work better with composite parsers) and replace the tika-config.xml file with a META-INF/services/org.apache.tika.parser.Parser file that simply lists all the Parser implementations within that jar file.
        Hide
        Chris A. Mattmann added a comment -

        Thanks for the more detail Jukka, but I fail to see how co-locating metadata with code (as in the case of JDK annotations) is any better of a mechanism that separating out such configuration into an XML file, Also, what is the difference between having the information in the tika-config.xml file versus locating (some of) that information int a META-INF/services/o.a.tika.parser.Parser file? I guess I just need to understand more b/c I'm missing something?

        Show
        Chris A. Mattmann added a comment - Thanks for the more detail Jukka, but I fail to see how co-locating metadata with code (as in the case of JDK annotations) is any better of a mechanism that separating out such configuration into an XML file, Also, what is the difference between having the information in the tika-config.xml file versus locating (some of) that information int a META-INF/services/o.a.tika.parser.Parser file? I guess I just need to understand more b/c I'm missing something?
        Hide
        Jukka Zitting added a comment -

        Re: co-locating metadata with code; Doing so makes it easier to support multiple different configuration mechanisms (default Tika config, programmatic configuration, OSGi services, IoC containers, etc.) as you don't need to duplicate the media type lists for each different way of configuring things.

        Re: tika-config.xml vs. META-INF/services/...; The service provider mechanism [1] makes it easy to add custom parser implementations without having to maintain a separate copy of the full Tika configuration file. You could for example create a my-custom-parsers.jar file with a META-INF/services/o.a.tika.parser.Parser file that lists only your custom parser classes. When you add that jar to the classpath, Tika would then automatically pick up those parsers in addition to the standard parser classes from the tika-parsers jar.

        [1] http://java.sun.com/j2se/1.5.0/docs/guide/jar/jar.html#Service Provider

        Show
        Jukka Zitting added a comment - Re: co-locating metadata with code; Doing so makes it easier to support multiple different configuration mechanisms (default Tika config, programmatic configuration, OSGi services, IoC containers, etc.) as you don't need to duplicate the media type lists for each different way of configuring things. Re: tika-config.xml vs. META-INF/services/...; The service provider mechanism [1] makes it easy to add custom parser implementations without having to maintain a separate copy of the full Tika configuration file. You could for example create a my-custom-parsers.jar file with a META-INF/services/o.a.tika.parser.Parser file that lists only your custom parser classes. When you add that jar to the classpath, Tika would then automatically pick up those parsers in addition to the standard parser classes from the tika-parsers jar. [1] http://java.sun.com/j2se/1.5.0/docs/guide/jar/jar.html#Service Provider
        Hide
        Jukka Zitting added a comment -

        Postponing to after 0.5

        Show
        Jukka Zitting added a comment - Postponing to after 0.5
        Hide
        Jukka Zitting added a comment -

        The attached patch introduces the following new Parser method:

        /**

        • Returns the set of media types supported by this parser when used
        • with the given parse context.
          *
        • @since Apache Tika 0.7
        • @param context parse context
        • @return immutable set of media types
          */
          Set<MediaType> getSupportedTypes(ParseContext context);

        An explicit method is better than static annotations since it allows the parsers to better adapt to situations where optional functionality like certain parser libraries are not available. This approach also works for things like parser compositions and decorations.

        The patch modifies the configuration mechanism so that the getSupportedTypes() method is used whenever a <parser/> entry without embedded <mime/> elements is encountered. This should maintain reasonable backwards compatibility with existing config files until Tika 1.0.

        Show
        Jukka Zitting added a comment - The attached patch introduces the following new Parser method: /** Returns the set of media types supported by this parser when used with the given parse context. * @since Apache Tika 0.7 @param context parse context @return immutable set of media types */ Set<MediaType> getSupportedTypes(ParseContext context); An explicit method is better than static annotations since it allows the parsers to better adapt to situations where optional functionality like certain parser libraries are not available. This approach also works for things like parser compositions and decorations. The patch modifies the configuration mechanism so that the getSupportedTypes() method is used whenever a <parser/> entry without embedded <mime/> elements is encountered. This should maintain reasonable backwards compatibility with existing config files until Tika 1.0.
        Hide
        Jukka Zitting added a comment -

        Updated issue topic from "Annotation-based ..." to "Service provider -based ...". This matches better the approach I've implemented.

        Show
        Jukka Zitting added a comment - Updated issue topic from "Annotation-based ..." to "Service provider -based ...". This matches better the approach I've implemented.
        Hide
        Jukka Zitting added a comment -

        I committed the proposed patch (revision 911195) and followed up with a change that makes the default Tika configuration use the Java service provider mechanism to find all available Parser classes (revision 911225).

        With this change you'll no longer need to maintain a custom copy of tika-config.xml if you want to extend Tika with your own parser classes. Instead you can just list your parser classes in a META-INF/services/org.apache.tika.parser.Parser file inside the jar that contains your extensions.

        Show
        Jukka Zitting added a comment - I committed the proposed patch (revision 911195) and followed up with a change that makes the default Tika configuration use the Java service provider mechanism to find all available Parser classes (revision 911225). With this change you'll no longer need to maintain a custom copy of tika-config.xml if you want to extend Tika with your own parser classes. Instead you can just list your parser classes in a META-INF/services/org.apache.tika.parser.Parser file inside the jar that contains your extensions.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development