Tika
  1. Tika
  2. TIKA-704

PDF and Outlook docs embedded in MS Word documents not parsed

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Windows 7 64-bit

      Description

      Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).

      From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:

      PDF's: application/vnd.ms-works
      .MSG: application/x-tika-msoffice

      The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

      1. TestWithPdf.docx
        3.80 MB
        Jeremy Anderson
      2. TestWithOutlook.docx
        3.76 MB
        Jeremy Anderson
      3. recursiveUsage.txt
        3 kB
        Jeremy Anderson
      4. LicensedTestWithPdf.docx
        3.80 MB
        Jeremy Anderson
      5. LicensedTestWithOutlook.docx
        111 kB
        Jeremy Anderson

        Activity

        Jeremy Anderson created issue -
        Jeremy Anderson made changes -
        Field Original Value New Value
        Attachment recursiveUsage.txt [ 12492633 ]
        Attachment TestWithOutlook.docx [ 12492634 ]
        Attachment TestWithPdf.docx [ 12492635 ]
        Jukka Zitting made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Jukka Zitting [ jukkaz ]
        Fix Version/s 1.0 [ 12313535 ]
        Resolution Fixed [ 1 ]
        Jeremy Anderson made changes -
        Attachment LicensedTestWithOutlook.docx [ 12493309 ]
        Attachment LicensedTestWithPdf.docx [ 12493310 ]
        Jukka Zitting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jeremy Anderson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development