Details
Description
When using the -t flag to tika, multi-byte content is destroyed in the output.
Example:
$ java -jar tika-app-0.4.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n
$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
Attachments
Attachments
Issue Links
- relates to
-
TIKA-3515 Tika CLI -t should use UTF-8 as default output encoding
- Resolved