[TIKA-3388] Ole10Native attachments with non-ASCII filenames extracted with garbled names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.26
Fix Version/s: 2.7.0
Component/s: None
Labels:
None

Description

I've encountered some Word files that have Ole10Native embeddeds which Tika extracts with strange filenames. It looks like the attachments were originally named with Chinese & Unicode characters, and the filename that Tika is giving is a cp1252 interpretation of the original UTF-8-encoded filename.

Looking closer at the Ole10Native stream of these files, it does seem like there is a UTF-8 version of the filename stored, as well as a UTF-16 version of the filename stored later on after the actual attachment data. I believe POI is returning this first UTF-8 version of the filename interpreted as if it were ANSI / cp1252.

A possible solution would for Apache POI to read and return the provided UTF-16 filename if it is present. Alternatively, Tika could check the currently returned "ANSI" name to see if it might actually be valid UTF-8.

Attached is an sample file I made which has a .msg file with name "約翰的測試文件🖖.msg" embedded in a .docx file. Tika currently extracts the attachment with filename "ç´ç¿°çæ¸¬è©¦æä»¶ð.msg"

–

Regarding the Ole10Native data stream, I can't find any official documentation for its structure, but these extra three UTF-16 string properties I'm seeing at the end look to follow the following format:

The strings are not null terminated, but instead are proceeded by a 4-byte string length value. Note that this value is the number of 16-bit code units in the UTF-16 string and not the byte length.
The order of the 3 strings is temporary path, filename, original path. This differs from the order of the normal ANSI / UTF-8 strings near the beginning of the Ole10Native stream which is filename, original path, temporary path.
I'm assuming these wide variants of these strings are optional and may not be present.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Ole10Native att with Unicode name.docx
08/May/21 00:39
16 kB
Ross Johnson

Activity

People

Assignee:: Unassigned

Reporter:: Ross Johnson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/May/21 00:41

Updated:: 06/Feb/23 22:41

Resolved:: 06/Feb/23 22:41