Tika
  1. Tika
  2. TIKA-298

CompositeParser.getParser() should use mimetype hierarchy when falling back

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.8
    • Component/s: parser
    • Labels:
      None

      Description

      CompositeParser.getParser() doesn't use supertypes when falling back - if it can't get a parser for the exact mimetype, then it goes
      straight to the fallback parser.

      So, for example, if the file mimetype is application/<whatever>+xml, and no parser exists for it, then you get the default "do nothing" parser versus the XML parser.

        Issue Links

          Activity

          Hide
          Ken Krugler added a comment -

          Jukka said on the mailing list:

          ========================================================
          Note that both the MimeType.getSuperType() method already does some
          of this and we have related supertype settings stored in the
          tika-mimetypes.xml configuration. The type registry could also be told
          about the +xml convention and related implicit supertype settings like
          the ones encoded in the MediaType.isSpecializationOf() method.

          (Note that we currently have both MimeType and MediaType classes for
          similar purposes. This is due to an ongoing redesign of the mime type
          registry. For now it's probably best to work on the MimeType class
          until the redesign is more complete.)
          ========================================================

          Show
          Ken Krugler added a comment - Jukka said on the mailing list: ======================================================== Note that both the MimeType.getSuperType() method already does some of this and we have related supertype settings stored in the tika-mimetypes.xml configuration. The type registry could also be told about the +xml convention and related implicit supertype settings like the ones encoded in the MediaType.isSpecializationOf() method. (Note that we currently have both MimeType and MediaType classes for similar purposes. This is due to an ongoing redesign of the mime type registry. For now it's probably best to work on the MimeType class until the redesign is more complete.) ========================================================
          Hide
          Chris A. Mattmann added a comment -
          • set fix component
          Show
          Chris A. Mattmann added a comment - set fix component
          Hide
          Jukka Zitting added a comment -

          I implemented a simple version of this in revision 938966.

          The fallback mechanism still doesn't support the full type hierarchy information in tika-mimetypes.xml, but already knows about base types and the hardcoded specialization rules in MediaType.isSpecializationOf().

          Show
          Jukka Zitting added a comment - I implemented a simple version of this in revision 938966. The fallback mechanism still doesn't support the full type hierarchy information in tika-mimetypes.xml, but already knows about base types and the hardcoded specialization rules in MediaType.isSpecializationOf().
          Hide
          Jukka Zitting added a comment -

          I committed a more complete implementation of this in revision 955963. The solution is based on the work in TIKA-308.

          Show
          Jukka Zitting added a comment - I committed a more complete implementation of this in revision 955963. The solution is based on the work in TIKA-308 .

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Ken Krugler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development