Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3388

Ole10Native attachments with non-ASCII filenames extracted with garbled names

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.26
    • 2.7.0
    • None
    • None

    Description

      I've encountered some Word files that have Ole10Native embeddeds which Tika extracts with strange filenames. It looks like the attachments were originally named with Chinese & Unicode characters, and the filename that Tika is giving is a cp1252 interpretation of the original UTF-8-encoded filename.

      Looking closer at the Ole10Native stream of these files, it does seem like there is a UTF-8 version of the filename stored, as well as a UTF-16 version of the filename stored later on after the actual attachment data. I believe POI is returning this first UTF-8 version of the filename interpreted as if it were ANSI / cp1252.

      A possible solution would for Apache POI to read and return the provided UTF-16 filename if it is present. Alternatively, Tika could check the currently returned "ANSI" name to see if it might actually be valid UTF-8.

      Attached is an sample file I made which has a .msg file with name "約翰的測試文件🖖.msg" embedded in a .docx file. Tika currently extracts the attachment with filename "約翰的測試文件🖖.msg"

      Regarding the Ole10Native data stream, I can't find any official documentation for its structure, but these extra three UTF-16 string properties I'm seeing at the end look to follow the following format:

      • The strings are not null terminated, but instead are proceeded by a 4-byte string length value. Note that this value is the number of 16-bit code units in the UTF-16 string and not the byte length.
      • The order of the 3 strings is temporary path, filename, original path. This differs from the order of the normal ANSI / UTF-8 strings near the beginning of the Ole10Native stream which is filename, original path, temporary path.
      • I'm assuming these wide variants of these strings are optional and may not be present.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rossj Ross Johnson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: