Affects Version/s: 1.2
Fix Version/s: None
attached document extracts correctly in Tika 1.1
attached document extracts incorrectly in tika 1.2.
The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
tika 1.2 appears to ignore the charset specified in the meta tag.
Some noodling seems to indicate that the problem is the charset.
it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage).