Details
Description
The code that demonstrates this bug can be found in attachment: ChineseTextExtraction.java.
Observed behavior:
Tika.parseToString(InputStream, Metadata) incorrectly detects 'application/octet-stream' for the Content-Type and returns an empty string for the contents.
Expected behavior:
It should detect 'text/plain' for the Content-Type and return a Unicode string of the contents of the file.
Notes:
GB2312.txt is a plain text file containing some Chinese encoded with the GB2312 charset. GB2312 is a very common charset and encoding. Tika should be able to handle this without any problems. In fact, the CharsetDetector class on its own accurately detects the charset as GB18030 which is a super set of GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to Unicode just fine. I don't understand why the Tika facade fails.
Edit:
I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common charset. It appears that this isn't just a GB2312 issue. It seems to work fine with ISO-8859-1 (English) files.