Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
1.1
-
None
-
None
Description
Hello,
I have Czech HTML files that contain meta tag with correct encoding (windows-1250) but Tika ignores that and detects ISO-8859-2 and some times even ISO-8859-9 (turkish). Which causes wrong diacritics processing.
Shouldn't it rather respect what HTML meta tag declares?
HTML file header:
<meta http-equiv=Content-Type content="text/html; charset=windows-1250">
Tika detected metadata:
Content-Encoding: ISO-8859-2
Content-Type: text/html; charset=windows-1250
I am not sure if I am reporting correctly. I will be happy to provide more infromation if necessary.
Thanks,
Tomas
Attachments
Issue Links
- duplicates
-
TIKA-431 Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
- Resolved