Tika
  1. Tika
  2. TIKA-447

Container aware mimetype detection

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.10
    • Component/s: mime
    • Labels:
      None

      Description

      As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.

      This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it

      Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

      1. TIKA-447-TikaInputStream.patch
        6 kB
        Jukka Zitting
      2. TikaContainerDetection.patch
        16 kB
        Nick Burch

        Activity

        Hide
        Nick Burch added a comment -

        Patch which implements limited ole2 and odf detection by parsing the containers. May not be the best way to do it however...

        Show
        Nick Burch added a comment - Patch which implements limited ole2 and odf detection by parsing the containers. May not be the best way to do it however...
        Hide
        Nick Burch added a comment -

        As no-one has objected, I've committed this initial code in r980058.

        With this commit, OLE2 based detection should be complete, and some Zip based detection is there, but some still remains to be added.

        Show
        Nick Burch added a comment - As no-one has objected, I've committed this initial code in r980058. With this commit, OLE2 based detection should be complete, and some Zip based detection is there, but some still remains to be added.
        Hide
        Chris A. Mattmann added a comment -

        Nick, awesome!

        Show
        Chris A. Mattmann added a comment - Nick, awesome!
        Hide
        Nick Burch added a comment -

        I've added support for OOXML files (detection + container re-use), as well as Jar files

        I believe the only zip based container format we can't currently detect with this is iWork. I've figured out how to tell it's an iWork document, but not how to tell which iWork document subtype it is.

        I think the only bit left for now is to document it. We don't currently have a Detection section in the documentation. Shall I create a new one, put in the basics from one of the apachecon Tika talks, then add a section on container aware detection?

        Show
        Nick Burch added a comment - I've added support for OOXML files (detection + container re-use), as well as Jar files I believe the only zip based container format we can't currently detect with this is iWork. I've figured out how to tell it's an iWork document, but not how to tell which iWork document subtype it is. I think the only bit left for now is to document it. We don't currently have a Detection section in the documentation. Shall I create a new one, put in the basics from one of the apachecon Tika talks, then add a section on container aware detection?
        Hide
        Chris A. Mattmann added a comment -

        Nick, awesome job! Comments below:

        I think the only bit left for now is to document it. We don't currently have a Detection section in the documentation. Shall I create a new one, put in the basics from one of the apachecon Tika talks, then add a section on container aware detection?

        Yep, I would do this. I would just add some APT documentation and create a section called "Detection", with some useful information on there. You could also then from that APT page, link to the page on the Wiki where the discussion on container Metadata occurred too:

        http://wiki.apache.org/tika/MetadataDiscussion

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Nick, awesome job! Comments below: I think the only bit left for now is to document it. We don't currently have a Detection section in the documentation. Shall I create a new one, put in the basics from one of the apachecon Tika talks, then add a section on container aware detection? Yep, I would do this. I would just add some APT documentation and create a section called "Detection", with some useful information on there. You could also then from that APT page, link to the page on the Wiki where the discussion on container Metadata occurred too: http://wiki.apache.org/tika/MetadataDiscussion Cheers, Chris
        Hide
        Jukka Zitting added a comment -

        It would be great if the AutoDetectParser could automatically leverage such detectors that use external parser libraries. The AutoDetectParser can't directly link to such parsers due to dependency issues, but we could use the service provider mechanism just like we do with Parser classes to automatically load all the Detectors available in the classpath. To do this effectively, I'd also add a Detector.getSupportedTypes() method like below so that more complex and potentially more expensive (need to read the entire document) detectors like POIFSContainerDetector could only be called if a more generic detector first determines that the input document matches the supported base type.

        /**

        • Returns the set of base media types supported by this detector
        • when used with the given parse context. The base media type can
        • be <code>application/octet-stream</code> for generic detectors
        • or a more specific type like <code>text/plain</code> or
        • <code>application/zip</code> for detectors that can only
        • distinguish between subtypes of that base type.
          *
        • @since Apache Tika 0.8
        • @param context parse context
        • @return immutable set of media types
          */
          Set<MediaType> getSupportedTypes(ParseContext context);
        Show
        Jukka Zitting added a comment - It would be great if the AutoDetectParser could automatically leverage such detectors that use external parser libraries. The AutoDetectParser can't directly link to such parsers due to dependency issues, but we could use the service provider mechanism just like we do with Parser classes to automatically load all the Detectors available in the classpath. To do this effectively, I'd also add a Detector.getSupportedTypes() method like below so that more complex and potentially more expensive (need to read the entire document) detectors like POIFSContainerDetector could only be called if a more generic detector first determines that the input document matches the supported base type. /** Returns the set of base media types supported by this detector when used with the given parse context. The base media type can be <code>application/octet-stream</code> for generic detectors or a more specific type like <code>text/plain</code> or <code>application/zip</code> for detectors that can only distinguish between subtypes of that base type. * @since Apache Tika 0.8 @param context parse context @return immutable set of media types */ Set<MediaType> getSupportedTypes(ParseContext context);
        Hide
        Nick Burch added a comment -

        At the moment, the ContainerAwareDetector checks the first 8 bytes of the file. If they match the OLE2 header signature, it hands it off to POIFS. If the first 4 bytes match the zip header signature, it does zip checking. If neither of them match, it falls back to the default detector

        To me, this seems simpler!

        Show
        Nick Burch added a comment - At the moment, the ContainerAwareDetector checks the first 8 bytes of the file. If they match the OLE2 header signature, it hands it off to POIFS. If the first 4 bytes match the zip header signature, it does zip checking. If neither of them match, it falls back to the default detector To me, this seems simpler!
        Hide
        Alex Ott added a comment -

        2Nick: does this will allow to implement support for self-extracted archives? Because, if we'll implement this as separate checker, then we'll need to implement archive extraction/detection inside this checker - this could lead to code duplication.

        Show
        Alex Ott added a comment - 2Nick: does this will allow to implement support for self-extracted archives? Because, if we'll implement this as separate checker, then we'll need to implement archive extraction/detection inside this checker - this could lead to code duplication.
        Hide
        Jukka Zitting added a comment -

        Hmm, I guess you're right, perhaps we won't need such multi-level detector functionality. The alternative is to simply load all available Detectors, run them on the input document and finally select the most specific of the returned media types.

        Show
        Jukka Zitting added a comment - Hmm, I guess you're right, perhaps we won't need such multi-level detector functionality. The alternative is to simply load all available Detectors, run them on the input document and finally select the most specific of the returned media types.
        Hide
        Alex Ott added a comment -

        It's better to have some flag, that will say "Stop, if this rule matched", because applying of all rules, could lead to weak performance
        It's better to have something like, for example for zips

        • rule for jar: zip-type == X1
        • rule for odf: zip-type == X2
          .....

        zip-type will calculated once on first invocation, and then re-used. And all rules (for jar, odf, etc.) have no flag "Stop here", while there will rule for ordinary zip's, that will have this flag, and we'll stop after checking of all subtypes.
        The same is could be implemented for OLE2 and other container formats, like OGG, etc.

        Show
        Alex Ott added a comment - It's better to have some flag, that will say "Stop, if this rule matched", because applying of all rules, could lead to weak performance It's better to have something like, for example for zips rule for jar: zip-type == X1 rule for odf: zip-type == X2 ..... zip-type will calculated once on first invocation, and then re-used. And all rules (for jar, odf, etc.) have no flag "Stop here", while there will rule for ordinary zip's, that will have this flag, and we'll stop after checking of all subtypes. The same is could be implemented for OLE2 and other container formats, like OGG, etc.
        Hide
        Nick Burch added a comment -

        Jukka - that might end up being more work though? Also, short of refactoring the current mime types to split out all the different bits, I'm not sure we will have that many new detectors ever?

        Show
        Nick Burch added a comment - Jukka - that might end up being more work though? Also, short of refactoring the current mime types to split out all the different bits, I'm not sure we will have that many new detectors ever?
        Hide
        Nick Burch added a comment -

        Alex - have a look at the code, I think it already does what you're asking of it

        For OLE2, when we detect the ole2 signature, we load the file into POIFS. We then ask the detector what it is based on this

        For Zip, we look at each entry in the zip file in turn. If it's one we recognise the name of, and that tells us all we need, we return. Otherwise, we open up that entry, and grab the mime type from that, and return.

        Show
        Nick Burch added a comment - Alex - have a look at the code, I think it already does what you're asking of it For OLE2, when we detect the ole2 signature, we load the file into POIFS. We then ask the detector what it is based on this For Zip, we look at each entry in the zip file in turn. If it's one we recognise the name of, and that tells us all we need, we return. Otherwise, we open up that entry, and grab the mime type from that, and return.
        Hide
        Alex Ott added a comment -

        Ah, sorry Nick - I hadn't looked into code yet. I thought, that we stuck in container if it matches to some signature.

        Show
        Alex Ott added a comment - Ah, sorry Nick - I hadn't looked into code yet. I thought, that we stuck in container if it matches to some signature.
        Hide
        Jukka Zitting added a comment -

        It's a bit more work, yes. What I'm trying to achieve here is for someone who just uses "new Tika().detect(...)" to be able to benefit from these extra detectors when they're available in the classpath.

        Show
        Jukka Zitting added a comment - It's a bit more work, yes. What I'm trying to achieve here is for someone who just uses "new Tika().detect(...)" to be able to benefit from these extra detectors when they're available in the classpath.
        Hide
        Nick Burch added a comment -

        Using the container aware detector will give a more accurate answer generally, but at the cost of more memory use, and longer processing time. (Oh, and plus the need for various parser dependencies)

        There was some reluctance on-list about making this the default, due to the memory and processing impact of opening the container, which we'll need to take notice of.

        There's also the issue of making sure the detectors run in the right order, which may matter for some but not for others. Alas I don't have a good answer for the way to handle all these different needs...

        Show
        Nick Burch added a comment - Using the container aware detector will give a more accurate answer generally, but at the cost of more memory use, and longer processing time. (Oh, and plus the need for various parser dependencies) There was some reluctance on-list about making this the default, due to the memory and processing impact of opening the container, which we'll need to take notice of. There's also the issue of making sure the detectors run in the right order, which may matter for some but not for others. Alas I don't have a good answer for the way to handle all these different needs...
        Hide
        Jukka Zitting added a comment -

        BTW, the current new Detector implementations are a bit troublesome as they break the contract that the detect() method must not close() the given stream and should use mark() and reset() where necessary to avoid changing the state of the stream. The rationale behind this contract is that you should be able to call parse() on the same stream instance after detecting its type.

        The attached patch fixes this issue by using the TikaInputStream.getFile() method to access the underlying file (when available or spooled) when detecting these kinds of complex container formats. If the given stream is not a TikaInputStream, then just the generic application/zip or application/x-tika-msoffice type is returned.

        Show
        Jukka Zitting added a comment - BTW, the current new Detector implementations are a bit troublesome as they break the contract that the detect() method must not close() the given stream and should use mark() and reset() where necessary to avoid changing the state of the stream. The rationale behind this contract is that you should be able to call parse() on the same stream instance after detecting its type. The attached patch fixes this issue by using the TikaInputStream.getFile() method to access the underlying file (when available or spooled) when detecting these kinds of complex container formats. If the given stream is not a TikaInputStream, then just the generic application/zip or application/x-tika-msoffice type is returned.
        Hide
        Jukka Zitting added a comment -

        I committed my patch in revision 982175.

        > memory and processing impact of opening the container

        I think this acceptable as the extra cost is only associated with specific media types, and we can use the open container feature you added to TikaInputStream to allow later parsing stages to avoid duplicating these costs. Also, since this functionality is now only triggered when the detector is passed a TikaInputStream, a performance-conscious user can easily prevent the extra processing. We might also want to add some extra flag for this if needed.

        > detectors run in the right order

        This was a part of my thinking behind the proposed getSupportedTypes() method. With that we could choose to only run these kinds of more complex detectors when simpler detectors have first identified the basic container format.

        Show
        Jukka Zitting added a comment - I committed my patch in revision 982175. > memory and processing impact of opening the container I think this acceptable as the extra cost is only associated with specific media types, and we can use the open container feature you added to TikaInputStream to allow later parsing stages to avoid duplicating these costs. Also, since this functionality is now only triggered when the detector is passed a TikaInputStream, a performance-conscious user can easily prevent the extra processing. We might also want to add some extra flag for this if needed. > detectors run in the right order This was a part of my thinking behind the proposed getSupportedTypes() method. With that we could choose to only run these kinds of more complex detectors when simpler detectors have first identified the basic container format.
        Hide
        Nick Burch added a comment -

        I've added some Detector documentation in r985242, please everyone dive in with bits I have missed!

        Show
        Nick Burch added a comment - I've added some Detector documentation in r985242, please everyone dive in with bits I have missed!
        Hide
        Jukka Zitting added a comment -

        I refactored the code a bit in revision 1042476 to make it easier to compose with other kinds of detectors. Most notably I removed the ContainerDetector interface and made the POIFSContainerDetector and ZipContainerDetector classes directly implement the Detector interface.

        Show
        Jukka Zitting added a comment - I refactored the code a bit in revision 1042476 to make it easier to compose with other kinds of detectors. Most notably I removed the ContainerDetector interface and made the POIFSContainerDetector and ZipContainerDetector classes directly implement the Detector interface.
        Hide
        Jukka Zitting added a comment -

        In revision 1042497 I added an auto-loading mechanism for detectors so that tools like the Tika facade or the AutoDetectParser class can automatically pick up all detector implementations in the current classpath. This way also the container-aware detectors can be used with minimal changes to client code.

        To prevent excessive performance overhead, both the Zip and POIFS detectors will first check for the relevant magic byte header and will only do the more expensive format check if the byte header matches and if the given stream is a TikaInputStream instance.

        In revision 1042498 I added a new --detect option to the CLI for easier testing of the auto-detect functionality. Also, since the container-aware detectors are now automatically loaded and used, there's no longer any need for the explicit --container-aware-detector option and I've turned it into a no-op.

        Show
        Jukka Zitting added a comment - In revision 1042497 I added an auto-loading mechanism for detectors so that tools like the Tika facade or the AutoDetectParser class can automatically pick up all detector implementations in the current classpath. This way also the container-aware detectors can be used with minimal changes to client code. To prevent excessive performance overhead, both the Zip and POIFS detectors will first check for the relevant magic byte header and will only do the more expensive format check if the byte header matches and if the given stream is a TikaInputStream instance. In revision 1042498 I added a new --detect option to the CLI for easier testing of the auto-detect functionality. Also, since the container-aware detectors are now automatically loaded and used, there's no longer any need for the explicit --container-aware-detector option and I've turned it into a no-op.
        Hide
        Jukka Zitting added a comment -

        I think we are pretty much done with this issue already.

        Before closing this, I'd like to move the new classes from within o.a.t.detect to appropriate o.a.t.parser subpackages in tika-parsers. That way the detection logic is closer to the related parser classes and we don't have to worry about split-package warnings from OSGi.

        Show
        Jukka Zitting added a comment - I think we are pretty much done with this issue already. Before closing this, I'd like to move the new classes from within o.a.t.detect to appropriate o.a.t.parser subpackages in tika-parsers. That way the detection logic is closer to the related parser classes and we don't have to worry about split-package warnings from OSGi.
        Hide
        Jukka Zitting added a comment -

        As suggested above, I moved the detector classes from o.a.t.detect to o.a.t.parser subpackages in revision 1159985.

        That should complete the last remaining open issue with this feature, so resolving as fixed.

        Show
        Jukka Zitting added a comment - As suggested above, I moved the detector classes from o.a.t.detect to o.a.t.parser subpackages in revision 1159985. That should complete the last remaining open issue with this feature, so resolving as fixed.

          People

          • Assignee:
            Unassigned
            Reporter:
            Nick Burch
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development