Tika
  1. Tika
  2. TIKA-806

MS Word Detection magics are a bit overzealous

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.1
    • Fix Version/s: 1.1
    • Component/s: mime
    • Labels:
      None

      Description

      tika-mimetypes.xml contains a following magic for MS Word:

      <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
      <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" type="string" offset="1152:4096" />
      </match>
      

      So if a file is an MS Office document (parent Office magic) and has the WordDocument string within the given offsets, then it's Word. I have a few (regrettably confidential) counterexamples of MS Excel files with embedded Word documents. For instance one has "Workbook" (with 0x00 between characters) at offset 0x0480 and "WordDocument" (0x00's between characters) at offset 0x0B80. This is an Excel file, which does meet the above-mentioned magic criterion. Returning x-tika-msoffice would dispatch the file to POI detector, which would return the correct answer.

      I vote for removing that magic. I took a look at some of my files and it seems that "WordDocument" and "Workbook" strings do occur at various offsets. The presence of embedded documents makes detection by those strings unreliable.

      1. tika-806-ver2.patch
        5 kB
        Antoni Mylka
      2. tika-806-ver3.zip
        4 kB
        Antoni Mylka

        Activity

        Hide
        Antoni Mylka added a comment -

        A patch which removes those magics from tika-mimetypes.xml.

        Show
        Antoni Mylka added a comment - A patch which removes those magics from tika-mimetypes.xml.
        Hide
        Alex Ott added a comment -

        The only reliable method to determine .doc/.xls/.ppt/... type is perform full parsing, and look to which objects are listed under root directory

        Show
        Alex Ott added a comment - The only reliable method to determine .doc/.xls/.ppt/... type is perform full parsing, and look to which objects are listed under root directory
        Hide
        Antoni Mylka added a comment -

        A second version of the patch which doesn't break the build. The unit tests are updated.

        Show
        Antoni Mylka added a comment - A second version of the patch which doesn't break the build. The unit tests are updated.
        Hide
        Nick Burch added a comment -

        The file format allows for the directory entries to occur at any point within the file, so you're correct that the only fully reliable way to detect the format is to open up the OLE2 container and see what the contents are

        However, the directory listing is often stored in the first couple of blocks, so it can allow for certain files to be detected without needing to open up the whole file and process it.

        We now prefer the container detectors over the mimetype ones by default, when using DefaultDetector, so this shouldn't be an issue on trunk. Is it?

        Show
        Nick Burch added a comment - The file format allows for the directory entries to occur at any point within the file, so you're correct that the only fully reliable way to detect the format is to open up the OLE2 container and see what the contents are However, the directory listing is often stored in the first couple of blocks, so it can allow for certain files to be detected without needing to open up the whole file and process it. We now prefer the container detectors over the mimetype ones by default, when using DefaultDetector, so this shouldn't be an issue on trunk. Is it?
        Hide
        Antoni Mylka added a comment -

        Probably not. Just that I don't use the DefaultDetector. In my app I first make a magic-based detection attempt base on the first 64KB of the file. Then if it's x-tika-msoffice, I dispatch the file to another component which then has access to the full file the "other component" can then use the POIFSContainerDetector.

        It's a legacy constraint which is difficult to change now without major rearchitecting. It's not a problem for a vast majority of files, but as I said I have some where document contain embedded documents. This is where it breaks.

        Another case where it breaks is an MS Works 7.0 Spreadsheet file. With MimeTypes it is are identified as ms-excel (due to the "Workbook" string inside). With the container detector it's correctly identifier as MS Works. IMHO a wrong result for some files is worse than a "more generic" result for other files, as they can be refined afterwards with the container aware detector.

        Show
        Antoni Mylka added a comment - Probably not. Just that I don't use the DefaultDetector. In my app I first make a magic-based detection attempt base on the first 64KB of the file. Then if it's x-tika-msoffice, I dispatch the file to another component which then has access to the full file the "other component" can then use the POIFSContainerDetector. It's a legacy constraint which is difficult to change now without major rearchitecting. It's not a problem for a vast majority of files, but as I said I have some where document contain embedded documents. This is where it breaks. Another case where it breaks is an MS Works 7.0 Spreadsheet file. With MimeTypes it is are identified as ms-excel (due to the "Workbook" string inside). With the container detector it's correctly identifier as MS Works. IMHO a wrong result for some files is worse than a "more generic" result for other files, as they can be refined afterwards with the container aware detector.
        Hide
        Antoni Mylka added a comment -

        It turns out that the XLR files are not detected by POIFSContainerDetector. With the third version of the patch they are. This should probably be reported as a separate issue, but it's difficult to separate them.

        Both boil down to the same thing. MimeTypes should not "guess" the concrete type of an msoffice document because there are two cases where it will return a wrong answer.

        1. A document with another document embedded within. The choice will depend on the ordering of matchers as in TIKA-391.
        2. A Works 7.0 Spreadsheet document will be detected as Excel, while it should be passed to the container detector.
        Show
        Antoni Mylka added a comment - It turns out that the XLR files are not detected by POIFSContainerDetector. With the third version of the patch they are. This should probably be reported as a separate issue, but it's difficult to separate them. Both boil down to the same thing. MimeTypes should not "guess" the concrete type of an msoffice document because there are two cases where it will return a wrong answer. A document with another document embedded within. The choice will depend on the ordering of matchers as in TIKA-391 . A Works 7.0 Spreadsheet document will be detected as Excel, while it should be passed to the container detector.
        Hide
        Nick Burch added a comment -

        If you use DefaultDetector it isn't an issue, as the container ones get run first. For you case, can't you just say "if the type is x-tika-msoffice or the type's parent is x-tika-msoffice use the container detector"?

        I agree that we need container aware detectors for true OLE2 detection (that's why I wrote the original POIFS detector!), but I'm not sure about removing mime magic that is commonly correct. For many people, having that in will give a better answer than not

        Show
        Nick Burch added a comment - If you use DefaultDetector it isn't an issue, as the container ones get run first. For you case, can't you just say "if the type is x-tika-msoffice or the type's parent is x-tika-msoffice use the container detector"? I agree that we need container aware detectors for true OLE2 detection (that's why I wrote the original POIFS detector!), but I'm not sure about removing mime magic that is commonly correct. For many people, having that in will give a better answer than not
        Hide
        Antoni Mylka added a comment -

        If you put it like this, then it becomes a matter of taste. I just thought that giving an answer that is "commonly correct" is not enough. Indeed, I can add such a hack and probably will do so. Maybe it's just that my requirements on correctness aren't that common.

        If it's only Nick vs. Me - let's close this issue and keep the status quo. Any other opinions?

        Show
        Antoni Mylka added a comment - If you put it like this, then it becomes a matter of taste. I just thought that giving an answer that is "commonly correct" is not enough. Indeed, I can add such a hack and probably will do so. Maybe it's just that my requirements on correctness aren't that common. If it's only Nick vs. Me - let's close this issue and keep the status quo. Any other opinions?
        Hide
        Nick Burch added a comment -

        You can always get a false positive with mime magic though... We can never be completely certain, so I tend to think the line should be drawn at "generally helpful and rarely harmful". (whether this comes under that may be a different matter!)

        For the OLE2 and Zip cases, we do provide more accurate detectors, which will only run for files with the right initial mime magic, so people who care about greater accuracy (at the expense of a little more processing time) can make use of that if they choose

        For your specific case, you only need to check the first 4 bytes to know if a file has the Zip or OLE2 mime magic. It may be best to have code that tries the first few bytes from your truncated stream, if it matches then it can pass the whole file to the appropriate container detector, and if not it can pass the first few kb to the regular mimetypes code. That's likely to be less brittle, as well as easier to follow. It should also cope well for adding other container detectors (eg Ogg) later.

        (Most people can simply pass in the whole stream to DefaultDetector and have something like this done for them, it's only special for you because you want to detect most files off of the initial few kb, with the whole file for certain types)

        Show
        Nick Burch added a comment - You can always get a false positive with mime magic though... We can never be completely certain, so I tend to think the line should be drawn at "generally helpful and rarely harmful". (whether this comes under that may be a different matter!) For the OLE2 and Zip cases, we do provide more accurate detectors, which will only run for files with the right initial mime magic, so people who care about greater accuracy (at the expense of a little more processing time) can make use of that if they choose For your specific case, you only need to check the first 4 bytes to know if a file has the Zip or OLE2 mime magic. It may be best to have code that tries the first few bytes from your truncated stream, if it matches then it can pass the whole file to the appropriate container detector, and if not it can pass the first few kb to the regular mimetypes code. That's likely to be less brittle, as well as easier to follow. It should also cope well for adding other container detectors (eg Ogg) later. (Most people can simply pass in the whole stream to DefaultDetector and have something like this done for them, it's only special for you because you want to detect most files off of the initial few kb, with the whole file for certain types)
        Hide
        Antoni Mylka added a comment -

        You're right. No further comments. I guess I can just make use of my newly-found JIRA authority and close this issue as "Not a Problem". Then I'll add the hack to the app. If in doubt - reopen.

        Show
        Antoni Mylka added a comment - You're right. No further comments. I guess I can just make use of my newly-found JIRA authority and close this issue as "Not a Problem". Then I'll add the hack to the app. If in doubt - reopen.

          People

          • Assignee:
            Antoni Mylka
            Reporter:
            Antoni Mylka
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development