[TIKA-704] PDF and Outlook docs embedded in MS Word documents not parsed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9
Fix Version/s: 0.10
Component/s: parser
Labels:
None
Environment:

Windows 7 64-bit

Description

Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).

From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:

PDF's: application/vnd.ms-works
.MSG: application/x-tika-msoffice

The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LicensedTestWithOutlook.docx
07/Sep/11 13:20
111 kB
Jeremy Anderson
LicensedTestWithPdf.docx
07/Sep/11 13:20
3.80 MB
Jeremy Anderson
recursiveUsage.txt
01/Sep/11 17:36
3 kB
Jeremy Anderson
TestWithOutlook.docx
01/Sep/11 17:36
3.76 MB
Jeremy Anderson
TestWithPdf.docx
01/Sep/11 17:36
3.80 MB
Jeremy Anderson

Activity

People

Assignee:: Jukka Zitting

Reporter:: Jeremy Anderson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Sep/11 17:34

Updated:: 20/Oct/11 12:34

Resolved:: 02/Sep/11 15:17