Description
HTML 5 allows to specify the character encoding of a page per
- <meta charset="...">
- Unicode Byte Order Mark (BOM)
These are allowed in addition to previous HTTP/http-equiv Content-Type, see [1.
Parse-html ignores both meta charset and BOM, falls back to the default encoding (cp1252). Parse-tika sets the encoding appropriately.