Affects Version/s: 1.1
Fix Version/s: None
I have Czech HTML files that contain meta tag with correct encoding (windows-1250) but Tika ignores that and detects ISO-8859-2 and some times even ISO-8859-9 (turkish). Which causes wrong diacritics processing.
Shouldn't it rather respect what HTML meta tag declares?
HTML file header:
<meta http-equiv=Content-Type content="text/html; charset=windows-1250">
Tika detected metadata:
Content-Type: text/html; charset=windows-1250
I am not sure if I am reporting correctly. I will be happy to provide more infromation if necessary.