Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.1-incubating
-
None
Description
Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream. We need to support other encodings as well.
It would be helpful to support the specification of an encoding in the parse method.
Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream. (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)