Details
Description
Tika producing empty string as output for UCS-2 and UCS-4 encoded txt files.
No logs or errors just an empty string.
Other formats are okay except UTF-7 files are parsed as UTF-8 which wreaks havoc with non ascii characters.
how do i know that?: I opened an the UTF-7 file as UTF-8 encoded using open dialog of gedit and found the outputs similar
I am attaching all four encoded files along with tika's output from parsing the UTF-7 for reference
Attachments
Attachments
Issue Links
- is related to
-
TIKA-2484 Improve CharsetDetector to recognize UTF-16LE/BE,UTF-32LE/BE and UTF-7 with/without BOMs correctly
- Open