[TIKA-721] UTF16-LE not detected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parser
Labels:
None

Description

I have a test file encoded in UTF16-LE, but Tika fails to detect it.

Note that it is missing the BOM, which is not allowed (for UTF16-BE
the BOM is optional).

Not sure we can realistically fix this; I have no idea how...

Here's what Tika detects:

windows-1250:   confidence=9
windows-1250:   confidence=7
windows-1252:   confidence=7
windows-1252:   confidence=6
windows-1252:   confidence=5
IBM420_ltr:     confidence=4
windows-1252:   confidence=3
windows-1254:   confidence=2
windows-1250:   confidence=2
windows-1252:   confidence=2
IBM420_rtl:     confidence=1
windows-1253:   confidence=1
windows-1250:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1
windows-1252:   confidence=1

The test file decodes fine as UTF16-LE; eg in Python just run this:

import codecs
codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Chinese_Simplified_utf16.txt
19/Sep/11 16:40
28 kB
Michael McCandless
TIKA-721.patch
02/Oct/11 16:20
14 kB
Michael McCandless

Issue Links

is duplicated by

TIKA-729 TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings

Resolved

is related to

TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents

Open

relates to

TIKA-2484 Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly

Open

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Sep/11 16:40

Updated:: 27/Oct/17 12:07