[TIKA-806] MS Word Detection magics are a bit overzealous - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.1
Fix Version/s: 1.1
Component/s: mime
Labels:
None

Description

tika-mimetypes.xml contains a following magic for MS Word:

<match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
<match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" type="string" offset="1152:4096" />
</match>

So if a file is an MS Office document (parent Office magic) and has the WordDocument string within the given offsets, then it's Word. I have a few (regrettably confidential) counterexamples of MS Excel files with embedded Word documents. For instance one has "Workbook" (with 0x00 between characters) at offset 0x0480 and "WordDocument" (0x00's between characters) at offset 0x0B80. This is an Excel file, which does meet the above-mentioned magic criterion. Returning x-tika-msoffice would dispatch the file to POI detector, which would return the correct answer.

I vote for removing that magic. I took a look at some of my files and it seems that "WordDocument" and "Workbook" strings do occur at various offsets. The presence of embedded documents makes detection by those strings unreliable.

Attachments

tika-806-ver2.patch
09/Dec/11 15:57
5 kB
Antoni Mylka
tika-806-ver3.zip
12/Dec/11 15:33
4 kB
Antoni Mylka

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Antoni Mylka

Reporter:: Antoni Mylka

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 09/Dec/11 15:05

Updated:: 13/Dec/11 13:36

Resolved:: 13/Dec/11 13:36

Agile

View on Board

MS Word Detection magics are a bit overzealous

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment