Tika
  1. Tika
  2. TIKA-704

PDF and Outlook docs embedded in MS Word documents not parsed

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Windows 7 64-bit

      Description

      Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).

      From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:

      PDF's: application/vnd.ms-works
      .MSG: application/x-tika-msoffice

      The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

      1. recursiveUsage.txt
        3 kB
        Jeremy Anderson
      2. TestWithOutlook.docx
        3.76 MB
        Jeremy Anderson
      3. TestWithPdf.docx
        3.80 MB
        Jeremy Anderson
      4. LicensedTestWithOutlook.docx
        111 kB
        Jeremy Anderson
      5. LicensedTestWithPdf.docx
        3.80 MB
        Jeremy Anderson

        Activity

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jeremy Anderson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development