Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.9.0
-
None
-
None
-
all
Description
After we defined encoding alias map in StringUtil , but parse html use orginal encoding also.
I found it is reading charset from meta in nekohtml which HtmlParser used .
we can set it's feature "http://cyberneko.org/html/features/scanner/ignore-specified-charset" to true
that nekohtml will use encoding we set;
concretely,
private DocumentFragment parseNeko(InputSource input) throws Exception {
DOMFragmentParser parser = new DOMFragmentParser();
// some plugins, e.g., creativecommons, need to examine html comments
try {
+ parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
parser.setFeature("http://apache.org/xml/features/include-comments",
true);
....
BTW, It must be add on front of try block,because the following sentence (parser.setFeature("http://apache.org/xml/features/include-comments",
true) will throw exception.