Tika
  1. Tika
  2. TIKA-786

Tika CLI --detect returns incorrect content-type for files with altered extensions

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.1
    • Component/s: cli
    • Labels:
      None
    • Environment:

      Windows

      Description

      From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else). MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files. The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types. To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".

        Activity

        Hide
        Nick Burch added a comment -

        In r1204435, I've added some failing+disabled unit tests for this. If you re-enable the tests on lines 81-83 and 127-129, you'll see this issue

        Show
        Nick Burch added a comment - In r1204435, I've added some failing+disabled unit tests for this. If you re-enable the tests on lines 81-83 and 127-129, you'll see this issue
        Hide
        Nick Burch added a comment -

        The problem seems to be with how DefaultDetector handles conflicting detection, which is different to how the previous ContainerAwareDetector did so

        Previously, the logic was to ask the container detectors to review the file. If they had a good match, that was used as the mimetype. Only if the container ones didn't know would the mime magic+filename detection (provided by MimeTypes) be used

        Under the new DefaultDetector system, this has changed. Instead, each detector is tried in turn, and while detectors are allowed to specialise a file they are not permitted to change it completely (if a previous one was wrong)

        It looks like this DefaultDetector logic will need to be changed, to allow detectors such as the container ones to override incorrect (typically filename based) detection

        Show
        Nick Burch added a comment - The problem seems to be with how DefaultDetector handles conflicting detection, which is different to how the previous ContainerAwareDetector did so Previously, the logic was to ask the container detectors to review the file. If they had a good match, that was used as the mimetype. Only if the container ones didn't know would the mime magic+filename detection (provided by MimeTypes) be used Under the new DefaultDetector system, this has changed. Instead, each detector is tried in turn, and while detectors are allowed to specialise a file they are not permitted to change it completely (if a previous one was wrong) It looks like this DefaultDetector logic will need to be changed, to allow detectors such as the container ones to override incorrect (typically filename based) detection
        Hide
        Jukka Zitting added a comment -

        Hmm, I didn't think of such a case when doing the DefaultDetector logic. My idea was that more accurate container detectors would just refine a more generic detection result from the basic detectors that are always run first. In this case though the basic detector ends up giving wrong results, which breaks my logic.

        Since the container detectors give practically always correct results, I guess it's fine to always use their results. Or perhaps even better, we could check the detectors in reverse order so that the most accurate detection result is used as the starting point and less accurate detection based on things like the file name could only refine the detection result to a more specific media type.

        Show
        Jukka Zitting added a comment - Hmm, I didn't think of such a case when doing the DefaultDetector logic. My idea was that more accurate container detectors would just refine a more generic detection result from the basic detectors that are always run first. In this case though the basic detector ends up giving wrong results, which breaks my logic. Since the container detectors give practically always correct results, I guess it's fine to always use their results. Or perhaps even better, we could check the detectors in reverse order so that the most accurate detection result is used as the starting point and less accurate detection based on things like the file name could only refine the detection result to a more specific media type.
        Hide
        Nick Burch added a comment -

        Do we have any control over the ordering though? My hunch is that user supplied ones should probably be used in preference to Tika ones, and the parser based detectors in Tika should be used in preference to the Mime Type ones

        One situation where the mimetype detection is better is with truncated files. Here the container detector can just say "looks like one of mine, can't tell you any more" while the mimetype one can use the filename to fill in the rest. I've a feeling that at least some people pass in only the first few kb of files for detection, to ensure it's fast, so their use case would want the MimeTypes detector logic based on filename to kick in to specialise.

        Show
        Nick Burch added a comment - Do we have any control over the ordering though? My hunch is that user supplied ones should probably be used in preference to Tika ones, and the parser based detectors in Tika should be used in preference to the Mime Type ones One situation where the mimetype detection is better is with truncated files. Here the container detector can just say "looks like one of mine, can't tell you any more" while the mimetype one can use the filename to fill in the rest. I've a feeling that at least some people pass in only the first few kb of files for detection, to ensure it's fast, so their use case would want the MimeTypes detector logic based on filename to kick in to specialise.
        Hide
        Jukka Zitting added a comment -

        Do we have any control over the ordering though?

        Some. The type database always comes first, which for most use cases should be good enough.

        One situation where the mimetype detection is better is with truncated files.

        Right. The good thing about the container detectors is that they only give a result (other than application/octet-stream) if they're really sure about the detection result. So with the proposed reverse detection order the type database would always be consulted last and be able to provide a fallback result in case none of the more accurate detectors worked.

        Show
        Jukka Zitting added a comment - Do we have any control over the ordering though? Some. The type database always comes first, which for most use cases should be good enough. One situation where the mimetype detection is better is with truncated files. Right. The good thing about the container detectors is that they only give a result (other than application/octet-stream) if they're really sure about the detection result. So with the proposed reverse detection order the type database would always be consulted last and be able to provide a fallback result in case none of the more accurate detectors worked.
        Hide
        Nick Burch added a comment -

        I've had a go at solving this in r1204476, by having DefaultDetector order them differently, based on the discussions here. (The reversing is done here, rather than in CompositeDetector, as that seems to make more sense to me)

        This has allowed me to enable the previously failing tests for this issue, and all other tests still pass

        Show
        Nick Burch added a comment - I've had a go at solving this in r1204476, by having DefaultDetector order them differently, based on the discussions here. (The reversing is done here, rather than in CompositeDetector, as that seems to make more sense to me) This has allowed me to enable the previously failing tests for this issue, and all other tests still pass
        Hide
        Jukka Zitting added a comment -

        Cool, looks good. I was simultaneously approaching this from a slightly different angle (see https://github.com/jukka/tika/commit/97a15bdcd79549d3c5147b7b8f9b6f46a9bb8fc5), but your changes look nicer (I like the way you can give preference to non-Tika detectors) so let's go with that.

        Show
        Jukka Zitting added a comment - Cool, looks good. I was simultaneously approaching this from a slightly different angle (see https://github.com/jukka/tika/commit/97a15bdcd79549d3c5147b7b8f9b6f46a9bb8fc5 ), but your changes look nicer (I like the way you can give preference to non-Tika detectors) so let's go with that.
        Hide
        Nick Burch added a comment -

        Explanation added to CHANGES in r1204479, so I think this is now resolved

        Show
        Nick Burch added a comment - Explanation added to CHANGES in r1204479, so I think this is now resolved

          People

          • Assignee:
            Unassigned
            Reporter:
            John Mastarone
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development