Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1010

Embedded documents in RTF are not extracted

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      When an RTF doc embeds a doc it looks like this:

      {\object\objemb
      \objw628\objh765{\*\objclass Package}{\*\objdata 0105000002000000080000005061636b61676500000000000000000066000000
      020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000000030022000000433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b00000048656c6c6f20576f726c64000001050000050000000d0000004d45544146494c455049435400
      54040000bbfaffffee0000000800540445050000
      0100090000037300000002001c0000000000050000000b0200000000050000000c02320029001c000000fb02f5ff000000000000900100000001000000005461686f6d61000055170a7000fc070058b1f37761b1f3772040f57749366683040000002d01000005000000090200000000050000000102ffffff0005000000
      020101000000050000002e0106000000090000002105060048772e747874210015001c000000fb021000070000000000bc02000000000102022253797374656d00004936668300000a0026008a0100000000ffffffff8cfc0700040000002d010100030000000000}
      

      But, unfortunately, the format of those hex bytes is not spelled out
      in the RTF spec ... the spec merely says the bytes are saved by the
      OLESaveToStream function ... and I haven't been able to find a
      description of what the bytes mean.

      In this case they are a "Package object" (\objclass Package), which I
      think is an [old?] way to wrap any non-OLE file (this is just a .txt
      file).

      Here's the hex dump:

      00000000  01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |............Pack|
      00000010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.........f...|
      00000020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
      00000030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
      00000040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt....."|
      00000050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
      00000060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
      00000070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.....Hello W|
      00000080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld............|
      00000090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
      000000a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.............T.E|
      000000b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.........s......|
      000000c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  |................|
      000000d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.....2.)........|
      000000e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  |................|
      000000f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
      00000100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
      00000110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.....-..........|
      00000120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  |................|
      00000130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  |................|
      00000140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.....!...Hw.txt!|
      00000150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  |................|
      00000160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.........."Syste|
      00000170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.....&....|
      00000180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...............-|
      00000190  01 01 00 03 00 00 00 00  00                       |.........|
      00000199
      

      Anyway I have no idea how to decode the bytes at this point ... just
      opening the issue in case anyone else does!

        Attachments

        1. outer.rtf
          97 kB
          Tim Allison
        2. ExampleRTFs.zip
          209 kB
          Chris Bamford
        3. testRTFRegularImages.rtf
          95 kB
          Tim Allison
        4. testRTF_embbededFiles.zip
          245 kB
          Tim Allison
        5. TIKA-1010_patch.zip
          329 kB
          Tim Allison
        6. TIKA-1010.patch
          49 kB
          Tim Allison
        7. xls_attachment_example.zip
          23 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                mikemccand Michael McCandless
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: