Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-806

MS Word Detection magics are a bit overzealous

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.1
    • 1.1
    • mime
    • None

    Description

      tika-mimetypes.xml contains a following magic for MS Word:

      <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
      <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" type="string" offset="1152:4096" />
      </match>
      

      So if a file is an MS Office document (parent Office magic) and has the WordDocument string within the given offsets, then it's Word. I have a few (regrettably confidential) counterexamples of MS Excel files with embedded Word documents. For instance one has "Workbook" (with 0x00 between characters) at offset 0x0480 and "WordDocument" (0x00's between characters) at offset 0x0B80. This is an Excel file, which does meet the above-mentioned magic criterion. Returning x-tika-msoffice would dispatch the file to POI detector, which would return the correct answer.

      I vote for removing that magic. I took a look at some of my files and it seems that "WordDocument" and "Workbook" strings do occur at various offsets. The presence of embedded documents makes detection by those strings unreliable.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            antheque Antoni Mylka
            antoni.mylka Antoni Mylka
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment