Tika
  1. Tika
  2. TIKA-533

Mis-detection of zip files as application/vnd.apple.iwork

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Windows 7 64-bit, latest Tika build as of 18th Oct 2010.

      Description

      It appears that, at least in some circumstances, a zip file containing only another zip file is being mis-detected as application/vnd.apple.iwork.
      In addition, for such files, the command-line parser does not return any output at all.

      1. zip-within-zip.zip
        0.3 kB
        Geoff Jarrad

        Activity

        Hide
        Jukka Zitting added a comment -

        This is already resolved as discussed above.

        Show
        Jukka Zitting added a comment - This is already resolved as discussed above.
        Hide
        Chris A. Mattmann added a comment -
        • pushing out to 0.9 – there's no patch for this yet and it's 0.8 release time
        Show
        Chris A. Mattmann added a comment - pushing out to 0.9 – there's no patch for this yet and it's 0.8 release time
        Hide
        Nick Burch added a comment -

        I can't see any more TODOs in the container detector code, so I guess it's good to go!

        We should probably provide a commandline option to let people switch back, and we'll want to do the gui too

        Show
        Nick Burch added a comment - I can't see any more TODOs in the container detector code, so I guess it's good to go! We should probably provide a commandline option to let people switch back, and we'll want to do the gui too
        Hide
        Jukka Zitting added a comment -

        Updated issue summary to better highlight that this is a problem with the application/vnd.apple.iwork type.

        > It's a little evil, but it's fast and low memory, and maybe someone can improve it later on!

        Those are the best kinds of changes. Incremental improvements FTW!

        BTW, I think we might as well make container-aware detection enabled by default in the CLI.

        Show
        Jukka Zitting added a comment - Updated issue summary to better highlight that this is a problem with the application/vnd.apple.iwork type. > It's a little evil, but it's fast and low memory, and maybe someone can improve it later on! Those are the best kinds of changes. Incremental improvements FTW! BTW, I think we might as well make container-aware detection enabled by default in the CLI.
        Hide
        Nick Burch added a comment -

        In r1024291 I've also added the --container-aware-detector flag to the Tika CLI

        Show
        Nick Burch added a comment - In r1024291 I've also added the --container-aware-detector flag to the Tika CLI
        Hide
        Nick Burch added a comment -

        I've added iWork support to the container aware detector in r1024255. It's a little evil, but it's fast and low memory, and maybe someone can improve it later on!

        Show
        Nick Burch added a comment - I've added iWork support to the container aware detector in r1024255. It's a little evil, but it's fast and low memory, and maybe someone can improve it later on!
        Hide
        Jukka Zitting added a comment -

        The magic byte pattern we added in TIKA-402 for detecting iWork documents seems to be too eager, as it matches also this document. Note that the test file being itself a part of a zip archive makes no difference; it is detected as application/vnd.apple.iwork even as a standalone document.

        I removed the application/vnd.apple.iwork magic byte pattern in revision 1023712, which should solve your problem. It looks like we should instead use a container-aware detector also for iWork, as I had to also disable a few iWork test cases that would no longer correctly detect the format.

        Show
        Jukka Zitting added a comment - The magic byte pattern we added in TIKA-402 for detecting iWork documents seems to be too eager, as it matches also this document. Note that the test file being itself a part of a zip archive makes no difference; it is detected as application/vnd.apple.iwork even as a standalone document. I removed the application/vnd.apple.iwork magic byte pattern in revision 1023712, which should solve your problem. It looks like we should instead use a container-aware detector also for iWork, as I had to also disable a few iWork test cases that would no longer correctly detect the format.
        Hide
        Geoff Jarrad added a comment -

        No, using the ContainerAwareDetector doesn't seem to change the detected content-type. My quick code test looks like this:

        String url = "file:/D:/debug-experiments/debug-corpus/zip-within-zip.zip";
        URL source = new URL(url);
        Metadata metadata = new Metadata();
        InputStream stream = source.openStream();
        AutoDetectParser a = new AutoDetectParser();
        ContainerAwareDetector c = new ContainerAwareDetector(a.getDetector());
        MediaType mt = c.detect(stream, metadata);
        System.out.printf("Detected media-type=%s\n", mt);

        The output for me is:

        Detected media-type=application/vnd.apple.iwork

        I acknowledge that I might be misusing the ContainerAwareDetector. In my usual code (due to the fact that Tika does not detect UTF-16 encoded XML as XML), I have been extracting the byte header myself and calling AutoDetectParser.getDetector().getMimeType(header) directly. This doesn't currently seem to be possible with the ContainerAwareDetector, hence I haven't been using it.

        Show
        Geoff Jarrad added a comment - No, using the ContainerAwareDetector doesn't seem to change the detected content-type. My quick code test looks like this: String url = "file:/D:/debug-experiments/debug-corpus/zip-within-zip.zip"; URL source = new URL(url); Metadata metadata = new Metadata(); InputStream stream = source.openStream(); AutoDetectParser a = new AutoDetectParser(); ContainerAwareDetector c = new ContainerAwareDetector(a.getDetector()); MediaType mt = c.detect(stream, metadata); System.out.printf("Detected media-type=%s\n", mt); The output for me is: Detected media-type=application/vnd.apple.iwork I acknowledge that I might be misusing the ContainerAwareDetector. In my usual code (due to the fact that Tika does not detect UTF-16 encoded XML as XML), I have been extracting the byte header myself and calling AutoDetectParser.getDetector().getMimeType(header) directly. This doesn't currently seem to be possible with the ContainerAwareDetector, hence I haven't been using it.
        Hide
        Nick Burch added a comment -

        We should probably add an option to the CLI to use the container aware detector - that'd fix this case, and if we're parsing the file anyway the extra detection work will even out the reduction in parser work.

        Geoff - if you use the container aware detector in your code, rather than the plain mime one, does that fix the problem for you?

        Show
        Nick Burch added a comment - We should probably add an option to the CLI to use the container aware detector - that'd fix this case, and if we're parsing the file anyway the extra detection work will even out the reduction in parser work. Geoff - if you use the container aware detector in your code, rather than the plain mime one, does that fix the problem for you?
        Hide
        Geoff Jarrad added a comment -

        Simple zip file containing a zip of an HTML file, demonstrating the specified bug.

        Show
        Geoff Jarrad added a comment - Simple zip file containing a zip of an HTML file, demonstrating the specified bug.

          People

          • Assignee:
            Unassigned
            Reporter:
            Geoff Jarrad
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development