[LUCENE-5191] SimpleHTMLEncoder in Highlighter module breaks Unicode outside BMP - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.5, 6.0
Component/s: modules/highlighter
Labels:
None

Lucene Fields:

New

Description

The highlighter provides a function to escape HTML, which does to much. To create valid HTML only ", <, >, & must be escaped, everything else can kept unescaped. The escaper unfortunately does also additionally escape everything > 127, which is unneeded if your web site has the correct encoding. It also produces huge amounts of HTML entities if used with eastern languages.

This would not be a bugf if the escaping would be correct, but it isn't, it escapes like that:

result.append("\&#").append((int)ch).append(";");

So it escapes not (as HTML needs) the unicode codepoint, instead it escapes the UTF-16 char, which is incorrect, e.g. for our all-time favourite Deseret:

U+10400 (deseret capital letter long i) would be escaped as &#55297;&#56320; and not as 𐐀.

So we should remove the stupid encoding of chars > 127 which is simply useless

See also: https://github.com/elasticsearch/elasticsearch/issues/3587

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5191.patch
29/Aug/13 21:55
2 kB
Uwe Schindler
LUCENE-5191.patch
28/Aug/13 23:09
0.7 kB
Uwe Schindler

Activity

People

Assignee:: Uwe Schindler

Reporter:: Uwe Schindler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Aug/13 23:04

Updated:: 28/Aug/22 13:52

Resolved:: 29/Aug/13 22:10

Agile

View on Board

SimpleHTMLEncoder in Highlighter module breaks Unicode outside BMP

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment