Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I have a test file encoded in UTF16-LE, but Tika fails to detect it.
Note that it is missing the BOM, which is not allowed (for UTF16-BE
the BOM is optional).
Not sure we can realistically fix this; I have no idea how...
Here's what Tika detects:
windows-1250: confidence=9 windows-1250: confidence=7 windows-1252: confidence=7 windows-1252: confidence=6 windows-1252: confidence=5 IBM420_ltr: confidence=4 windows-1252: confidence=3 windows-1254: confidence=2 windows-1250: confidence=2 windows-1252: confidence=2 IBM420_rtl: confidence=1 windows-1253: confidence=1 windows-1250: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1 windows-1252: confidence=1
The test file decodes fine as UTF16-LE; eg in Python just run this:
import codecs codecs.getdecoder('utf_16_le')(open('Chinese_Simplified_utf16.txt').read())
Attachments
Attachments
Issue Links
- is duplicated by
-
TIKA-729 TIKA CharsetDetector not detecting UTF-16BE/UTF-16LE encodings
- Resolved
- is related to
-
TIKA-2038 A more accurate facility for detecting Charset Encoding of HTML documents
- Open
- relates to
-
TIKA-2484 Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly
- Open