Tika
  1. Tika
  2. TIKA-704

PDF and Outlook docs embedded in MS Word documents not parsed

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Windows 7 64-bit

      Description

      Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).

      From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:

      PDF's: application/vnd.ms-works
      .MSG: application/x-tika-msoffice

      The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

      1. LicensedTestWithPdf.docx
        3.80 MB
        Jeremy Anderson
      2. LicensedTestWithOutlook.docx
        111 kB
        Jeremy Anderson
      3. TestWithPdf.docx
        3.80 MB
        Jeremy Anderson
      4. TestWithOutlook.docx
        3.76 MB
        Jeremy Anderson
      5. recursiveUsage.txt
        3 kB
        Jeremy Anderson

        Activity

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jeremy Anderson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development