Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-982

RTF document embedded into Word (.doc) document is extracted as .unknown

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.3
    • None
    • None

    Attachments

      1. TIKA-982.patch
        5 kB
        Michael McCandless

      Activity

        Patch with test case and fix, adding another case to
        POIFSContainerDetector.detect that looks for Contents and
        \u0003ObjInfo and returns COMP_OBJ (so we just extract the binary blob
        from Contents). I also had to fix
        AbstractPOIFSExtractor.handleEmbeddedOfficeDoc to try to open Contents
        if CONTENTS is not found.

        mikemccand Michael McCandless added a comment - Patch with test case and fix, adding another case to POIFSContainerDetector.detect that looks for Contents and \u0003ObjInfo and returns COMP_OBJ (so we just extract the binary blob from Contents). I also had to fix AbstractPOIFSExtractor.handleEmbeddedOfficeDoc to try to open Contents if CONTENTS is not found.

        People

          mikemccand Michael McCandless
          mikemccand Michael McCandless
          Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved: