[TIKA-786] Tika CLI --detect returns incorrect content-type for files with altered extensions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.1
Fix Version/s: 1.1
Component/s: cli
Labels:
None
Environment:

Windows

Description

From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else). MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files. The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types. To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: John Mastarone

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 21/Nov/11 02:42

Updated:: 21/Nov/11 13:16

Resolved:: 21/Nov/11 13:16