Tika
  1. Tika
  2. TIKA-686

Split tika-parsers into separate components

    Details

    • Type: Wish Wish
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      The email thread [1] from two years ago that led to splitting Tika into separate components also suggested splitting tika-parsers into separate components based on dependencies. This would be extremely useful, especially in cases where a given parser has no dependencies beyond tika-core. Please consider refactoring the parsers into separate components for 1.0.

      [1] http://markmail.org/message/tavirkqhn6r2szrz

        Issue Links

          Activity

          Hide
          Ken Krugler added a comment -

          I'm in favor of anything that helps with avoiding dependencies on POI, if all I want to parse are text-ish formats

          I assume we could still have a tika-parsers that has all of the parsers, which just has dependencies on all of the tika-parser-xxx components.

          Note that there's still the issue of some of Tika's functionality gracefully handling missing components. IIRC, some of Tika's configuration is still driven primarily by data, versus some combination of data plus what's available at run time.

          Show
          Ken Krugler added a comment - I'm in favor of anything that helps with avoiding dependencies on POI, if all I want to parse are text-ish formats I assume we could still have a tika-parsers that has all of the parsers, which just has dependencies on all of the tika-parser-xxx components. Note that there's still the issue of some of Tika's functionality gracefully handling missing components. IIRC, some of Tika's configuration is still driven primarily by data, versus some combination of data plus what's available at run time.
          Hide
          Nick Burch added a comment -

          I'd personally not be in favour of having lots of Tika parser jars - I think it would make things much more complicated, and lead to confusion when people accidentally missed one out

          Instead, is it not better to have parsers log but then bow out when they can't find their dependencies? That way, if you don't want to parse the microsoft office formats you ditch the POI dependencies, keep the standard Tika parser Jar, ignore the warning and you're away

          Show
          Nick Burch added a comment - I'd personally not be in favour of having lots of Tika parser jars - I think it would make things much more complicated, and lead to confusion when people accidentally missed one out Instead, is it not better to have parsers log but then bow out when they can't find their dependencies? That way, if you don't want to parse the microsoft office formats you ditch the POI dependencies, keep the standard Tika parser Jar, ignore the warning and you're away
          Hide
          Jukka Zitting added a comment -

          We already did quite a bit of work towards making Tika degrade gracefully when some dependencies are not present, so for now I'd rather encourage people to exclude those dependencies they don't want instead of having to deal with an explosion of dependencies.

          My original idea for the Parser interface was that upstream parser libraries could actually implement the interface directly, so that we wouldn't even need any code in tika-parsers. So far we haven't done that too much because the Parser interface was still evolving, but with the AbstractParser class and the proposed cleanup of the Parser interface in 1.0 we should be in a good position to start pushing the Parser implementations upstream.

          For example with POI we could push the entire o.a.tika.parsers.microsoft package up to be maintained and included inside POI as something like o.a.poi.tika, either inside one of the existing POI jars (with tika-core as an optional dependency) or as a separate poi-tika jar. Then people could get MS Office support with dependencies to nothing but tika-core and POI. The tika-parsers component would still exist as a composite that mostly just brings together all known Apache-compatible parser implementations.

          Show
          Jukka Zitting added a comment - We already did quite a bit of work towards making Tika degrade gracefully when some dependencies are not present, so for now I'd rather encourage people to exclude those dependencies they don't want instead of having to deal with an explosion of dependencies. My original idea for the Parser interface was that upstream parser libraries could actually implement the interface directly, so that we wouldn't even need any code in tika-parsers. So far we haven't done that too much because the Parser interface was still evolving, but with the AbstractParser class and the proposed cleanup of the Parser interface in 1.0 we should be in a good position to start pushing the Parser implementations upstream. For example with POI we could push the entire o.a.tika.parsers.microsoft package up to be maintained and included inside POI as something like o.a.poi.tika, either inside one of the existing POI jars (with tika-core as an optional dependency) or as a separate poi-tika jar. Then people could get MS Office support with dependencies to nothing but tika-core and POI. The tika-parsers component would still exist as a composite that mostly just brings together all known Apache-compatible parser implementations.
          Hide
          Ken Krugler added a comment -

          @Nick - my thought that that we'd have a tika-parsers that had dependencies on all of the parsers, so if you want them all you'd just have to have a dependency on that.

          This would be similar to what Jukka talked about, where tika-parsers is a composite that brings all of the individual parsers together.

          Though if you're not using a dependency management system, that would make things harder.

          @Jukka - what are you concerns about "an explosion of dependencies", if that was the case.

          @Jukka - What is your assessment of the current state of affairs in Tika, for gracefully handling missing dependencies? I haven't tracked recent changes, but I thought that we'd run into a new cause of failure when a required library was excluded.

          Show
          Ken Krugler added a comment - @Nick - my thought that that we'd have a tika-parsers that had dependencies on all of the parsers, so if you want them all you'd just have to have a dependency on that. This would be similar to what Jukka talked about, where tika-parsers is a composite that brings all of the individual parsers together. Though if you're not using a dependency management system, that would make things harder. @Jukka - what are you concerns about "an explosion of dependencies", if that was the case. @Jukka - What is your assessment of the current state of affairs in Tika, for gracefully handling missing dependencies? I haven't tracked recent changes, but I thought that we'd run into a new cause of failure when a required library was excluded.
          Hide
          Christopher Currie added a comment -

          I admit up front I'm biased toward the dependency management case. From my perspective it's a pain to have to dig into the dependencies and exclude all the ones I don't want.

          In the end, I think the key question is "what's the common case?" Is it more common to need a lot of parsers, or just one or two? If it's the former, I think keeping a single jar makes a lot of sense. If it's one or two, then I think having separate jars makes things better, because end-users have a clear path: only care about AutoCAD? Take the DWGParser jar and you're done.

          Alternatively, there are other Maven-level options that could be considered that would be an improvement on the current state:

          1. Make all of the dependencies of tika-parsers 'optional', except for tika-core. This more closely matches the non-dependency-managed scenario, where the end user is responsible for making sure he or she has all the required dependencies for the parser in question.

          2. Create pom-only modules for each parser, that pre-document the depenedency filter. In other words, for each parser 'foo', create a tika-parser-foo pom that depends on tika-parsers but excludes the dependencies that are not needed by that parser. This saves each end user from the work of figuring out the exclusion list by themselves.

          Since I'm making the request, I'm happy to volunteer myself for some of the grunt-work for any of these solutions, if resources are needed to get them done.

          Show
          Christopher Currie added a comment - I admit up front I'm biased toward the dependency management case. From my perspective it's a pain to have to dig into the dependencies and exclude all the ones I don't want. In the end, I think the key question is "what's the common case?" Is it more common to need a lot of parsers, or just one or two? If it's the former, I think keeping a single jar makes a lot of sense. If it's one or two, then I think having separate jars makes things better, because end-users have a clear path: only care about AutoCAD? Take the DWGParser jar and you're done. Alternatively, there are other Maven-level options that could be considered that would be an improvement on the current state: 1. Make all of the dependencies of tika-parsers 'optional', except for tika-core. This more closely matches the non-dependency-managed scenario, where the end user is responsible for making sure he or she has all the required dependencies for the parser in question. 2. Create pom-only modules for each parser, that pre-document the depenedency filter. In other words, for each parser 'foo', create a tika-parser-foo pom that depends on tika-parsers but excludes the dependencies that are not needed by that parser. This saves each end user from the work of figuring out the exclusion list by themselves. Since I'm making the request, I'm happy to volunteer myself for some of the grunt-work for any of these solutions, if resources are needed to get them done.
          Hide
          Antoni Mylka added a comment -

          FWIW I would say that fewer is better.

          We (Aperture) tried it and overdid this. Long story short: version 1.4 was split into 73 modules, with 31 external dependencies, builds took forever and day-to-day development work was a pain. It was madness. Clearly, with a bit more common sense it might have worked out better, but the key issue was that nobody wanted this and everyone used a special 'onejar' assembly anyway.

          I don't like optional dependencies. I need lots of XML in my pom to make my app work.

          I personally like exclusions better. Just it's necessary to make sure that

          {{<dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-parsers</artifactId>
          <exclusions>
          <exclusion>
          <groupId>org.apache.poi</groupId>
          <artifactId>poi</artifactId>
          </exclusion>
          <exclusion>
          <groupId>org.apache.poi</groupId>
          <artifactId>poi-scratchpad</artifactId>
          </exclusion>
          <exclusion>
          <groupId>org.apache.poi</groupId>
          <artifactId>poi-ooxml</artifactId>
          </exclusion>
          </exclusions>
          </dependency>}}

          ... works without ClassNotFoundErrors. (Aperture throws them in such a case right now).

          A solution with pom-only modules for each parser are OK as long as the default case is left as it is. The same problem will have to be solved though. If I only want office with poi, then the Tika facade must not initialize the PdfParser even though the class itself is present on the classpath, just its dependencies aren't.

          Show
          Antoni Mylka added a comment - FWIW I would say that fewer is better. We (Aperture) tried it and overdid this. Long story short: version 1.4 was split into 73 modules, with 31 external dependencies, builds took forever and day-to-day development work was a pain. It was madness. Clearly, with a bit more common sense it might have worked out better, but the key issue was that nobody wanted this and everyone used a special 'onejar' assembly anyway. I don't like optional dependencies. I need lots of XML in my pom to make my app work. I personally like exclusions better. Just it's necessary to make sure that {{<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <exclusions> <exclusion> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> </exclusion> <exclusion> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> </exclusion> <exclusion> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> </exclusion> </exclusions> </dependency>}} ... works without ClassNotFoundErrors. (Aperture throws them in such a case right now). A solution with pom-only modules for each parser are OK as long as the default case is left as it is. The same problem will have to be solved though. If I only want office with poi, then the Tika facade must not initialize the PdfParser even though the class itself is present on the classpath, just its dependencies aren't.
          Hide
          Nick Burch added a comment -

          Does anyone know of a good resource for how imports, method signatures, includes etc affect when a missing dependency will trigger a problem?

          It's all very well having the Parser constructor try a Class.forName and throwing a DependencyMissingException or similar, but if we've done something that means the Parser blows up with a ClassNotFound before the constructor then that's no help...

          Show
          Nick Burch added a comment - Does anyone know of a good resource for how imports, method signatures, includes etc affect when a missing dependency will trigger a problem? It's all very well having the Parser constructor try a Class.forName and throwing a DependencyMissingException or similar, but if we've done something that means the Parser blows up with a ClassNotFound before the constructor then that's no help...
          Hide
          Jukka Zitting added a comment -

          See PDFBOX-1132 for an example of how we could/should start pushing our parser classes to upstream projects.

          Show
          Jukka Zitting added a comment - See PDFBOX-1132 for an example of how we could/should start pushing our parser classes to upstream projects.
          Hide
          Antoni Mylka added a comment -

          Why keep this issue open?

          PdfParser appeared in PdfBox (PDFBOX-1132). Keeping both hardly makes sense and has already been identified as a problem (TIKA-810). Pushing parsers upstream covers the "I'm in favor of anything that helps with avoiding dependencies on POI" use case of Ken. We agree that we keep the dependency from tika-parsers to POI (doubts about that dispelled in http://mail-archives.apache.org/mod_mbox/tika-dev/201112.mbox/%3C4EEBA9CA.9030900%40gmail.com%3E). With this dependency, it will be possible to use the maven exclusion construct, exactly as described in my "I like exclusions better" post. So all known use cases are covered.

          Since we can't actually remove the PdfParser from Tika now (as that would definitely be a backward-incompatible change), we should deprecate it, remove it from the /META-INF/services/org.apache.tika.parser.Parser and replace the implementation with a delegation to the pdfbox version, but that would fall within the scope of TIKA-810.

          Anyway, this can be closed. The discussion can continue in TIKA-810 and in some new issue for POI.

          WDYT?

          Show
          Antoni Mylka added a comment - Why keep this issue open? PdfParser appeared in PdfBox ( PDFBOX-1132 ). Keeping both hardly makes sense and has already been identified as a problem ( TIKA-810 ). Pushing parsers upstream covers the "I'm in favor of anything that helps with avoiding dependencies on POI" use case of Ken. We agree that we keep the dependency from tika-parsers to POI (doubts about that dispelled in http://mail-archives.apache.org/mod_mbox/tika-dev/201112.mbox/%3C4EEBA9CA.9030900%40gmail.com%3E ). With this dependency, it will be possible to use the maven exclusion construct, exactly as described in my "I like exclusions better" post. So all known use cases are covered. Since we can't actually remove the PdfParser from Tika now (as that would definitely be a backward-incompatible change), we should deprecate it, remove it from the /META-INF/services/org.apache.tika.parser.Parser and replace the implementation with a delegation to the pdfbox version, but that would fall within the scope of TIKA-810 . Anyway, this can be closed. The discussion can continue in TIKA-810 and in some new issue for POI. WDYT?
          Hide
          Jukka Zitting added a comment -

          Resolving as Won't Fix as there's no clear consensus on how to proceed. Let's use the mailing list to discuss this and come back to the issue tracker only once there's a concrete plan of action.

          Show
          Jukka Zitting added a comment - Resolving as Won't Fix as there's no clear consensus on how to proceed. Let's use the mailing list to discuss this and come back to the issue tracker only once there's a concrete plan of action.

            People

            • Assignee:
              Unassigned
              Reporter:
              Christopher Currie
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development