[TIKA-2311] Preserve "x-tika-ooxml" mime value for truncated ooxml files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.15, 2.0.0
Component/s: None
Labels:
None

Description

The following is an unintended consequence of ~~TIKA-2212~~.

The OOXML parser used to handle x-tika-ooxml. We have some truncated ooxml files in our regression corpus. The previous behavior was:

1) ZipPackage detector caught the zip truncation exception and returned "application/zip"
2) The mime detector recognized magic and returned x-tika-ooxml
3) The file was then routed to the OOXML parser which didn't wind up doing much with the content because it hit the zip exception early on, but the final mime type was x-tika-ooxml.

The current behavior
1) Same detection steps
2) However, because the OOXML parser no longer handles x-tika-ooxml, the file is handled by the Package Parser, which overwrites the magic-determined mime type, and the new mime type is application/zip.
3) Some content is extracted because the Package parser handles the zip entries in order and only throws the exception once it hits the last entry in the zip file.

Ideally, I'd like to keep the magic-determined mime detection. Once we can chain parsers, the user should be able to backoff to the PackageParser, but I don't think this should be the default behavior.

One solution would be to create a new mime type that is not the parent of the other ooxml subtypes, but is itself a leaf subtype, something like: x-tika-ooxml-unk.

Any objections/other recommendations?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2ANEZKVHPKXC4VYR2HKYUYRVWFLCQTXI
01/May/17 14:02
24 kB
Tim Allison

Issue Links

relates to

TIKA-2483 Using PackageParser in ForkParser causes NPE

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Mar/17 11:30

Updated:: 12/Apr/21 13:02

Resolved:: 01/May/17 19:23