Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-704

PDF and Outlook docs embedded in MS Word documents not parsed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9
    • 0.10
    • parser
    • None
    • Windows 7 64-bit

    Description

      Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).

      From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:

      PDF's: application/vnd.ms-works
      .MSG: application/x-tika-msoffice

      The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

      Attachments

        1. TestWithPdf.docx
          3.80 MB
          Jeremy Anderson
        2. TestWithOutlook.docx
          3.76 MB
          Jeremy Anderson
        3. recursiveUsage.txt
          3 kB
          Jeremy Anderson
        4. LicensedTestWithPdf.docx
          3.80 MB
          Jeremy Anderson
        5. LicensedTestWithOutlook.docx
          111 kB
          Jeremy Anderson

        Activity

          People

            jukkaz Jukka Zitting
            rpialum Jeremy Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: