Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
New
Description
The highlighter provides a function to escape HTML, which does to much. To create valid HTML only ", <, >, & must be escaped, everything else can kept unescaped. The escaper unfortunately does also additionally escape everything > 127, which is unneeded if your web site has the correct encoding. It also produces huge amounts of HTML entities if used with eastern languages.
This would not be a bugf if the escaping would be correct, but it isn't, it escapes like that:
result.append("\&#").append((int)ch).append(";");
So it escapes not (as HTML needs) the unicode codepoint, instead it escapes the UTF-16 char, which is incorrect, e.g. for our all-time favourite Deseret:
U+10400 (deseret capital letter long i) would be escaped as �� and not as 𐐀.
So we should remove the stupid encoding of chars > 127 which is simply useless
See also: https://github.com/elasticsearch/elasticsearch/issues/3587