Lucene - Core
  1. Lucene - Core
  2. LUCENE-5191

SimpleHTMLEncoder in Highlighter module breaks Unicode outside BMP

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The highlighter provides a function to escape HTML, which does to much. To create valid HTML only ", <, >, & must be escaped, everything else can kept unescaped. The escaper unfortunately does also additionally escape everything > 127, which is unneeded if your web site has the correct encoding. It also produces huge amounts of HTML entities if used with eastern languages.

      This would not be a bugf if the escaping would be correct, but it isn't, it escapes like that:

      result.append("\&#").append((int)ch).append(";");

      So it escapes not (as HTML needs) the unicode codepoint, instead it escapes the UTF-16 char, which is incorrect, e.g. for our all-time favourite Deseret:

      U+10400 (deseret capital letter long i) would be escaped as &#55297;&#56320; and not as &#66560;.

      So we should remove the stupid encoding of chars > 127 which is simply useless

      See also: https://github.com/elasticsearch/elasticsearch/issues/3587

      1. LUCENE-5191.patch
        2 kB
        Uwe Schindler
      2. LUCENE-5191.patch
        0.7 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Simple patch.

        Show
        Uwe Schindler added a comment - Simple patch.
        Hide
        Walter Underwood added a comment -

        A different fix would be to iterate on codepoints instead of characters. That would fix the botch with high and low surrogates.

        Show
        Walter Underwood added a comment - A different fix would be to iterate on codepoints instead of characters. That would fix the botch with high and low surrogates.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Walter,

        I agree this could be used to fix, but its not useful here! There is no need to escape codepoints > 127. It just produces huge junks of escapes for all eastern languages! Escaping chars > 127 was done in the 1990s when web pages were not able to use other charsets than ISO-8859-1 or US-ASCII (and HTTP version 0.9 was not binary safe).

        Show
        Uwe Schindler added a comment - - edited Hi Walter, I agree this could be used to fix, but its not useful here! There is no need to escape codepoints > 127. It just produces huge junks of escapes for all eastern languages! Escaping chars > 127 was done in the 1990s when web pages were not able to use other charsets than ISO-8859-1 or US-ASCII (and HTTP version 0.9 was not binary safe).
        Hide
        Uwe Schindler added a comment -

        We have a variant of this code, recently added by Robert Muir into PostingsHighlighter's DefaultPassageFormatter.

        This escapes a little bit more chars, with a reference to OWASP: https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.231_-_HTML_Escape_Before_Inserting_Untrusted_Data_into_HTML_Element_Content and https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.232_-_Attribute_Escape_Before_Inserting_Untrusted_Data_into_HTML_Common_Attributes

        The code used here escapes any charis >127 and <255 according to the second rule, which is not needed here, because the escaped data is not included into HTML attributes which may be "unquoted". So for this only the first rule applies, in which it is enough to escape the 4 well-known escapes and also the forward slash + single quote ('). The latter two ones do not need to be escaped if used in text, but for safety we could include them.

        In any case I would like to unify the different approaches of HTML escaping. As we are not working in unquoted attributes (we just encode floating HTML text), I would use Robert's code without the extra numeric escapes.

        The official HTML4 spec (I used HTML4, the passage is the same for other HTML, see http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2):

        Four character entity references deserve special mention since they are frequently used to escape special characters:

        "<" represents the < sign.
        ">" represents the > sign.
        "&" represents the & sign.
        "" represents the " mark.
        Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

        Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.

        Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

        Any comments?

        Show
        Uwe Schindler added a comment - We have a variant of this code, recently added by Robert Muir into PostingsHighlighter's DefaultPassageFormatter. This escapes a little bit more chars, with a reference to OWASP: https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.231_-_HTML_Escape_Before_Inserting_Untrusted_Data_into_HTML_Element_Content and https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.232_-_Attribute_Escape_Before_Inserting_Untrusted_Data_into_HTML_Common_Attributes The code used here escapes any charis >127 and <255 according to the second rule, which is not needed here, because the escaped data is not included into HTML attributes which may be "unquoted". So for this only the first rule applies, in which it is enough to escape the 4 well-known escapes and also the forward slash + single quote ('). The latter two ones do not need to be escaped if used in text, but for safety we could include them. In any case I would like to unify the different approaches of HTML escaping. As we are not working in unquoted attributes (we just encode floating HTML text), I would use Robert's code without the extra numeric escapes. The official HTML4 spec (I used HTML4, the passage is the same for other HTML, see http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2 ): Four character entity references deserve special mention since they are frequently used to escape special characters: "<" represents the < sign. ">" represents the > sign. "&" represents the & sign. "" represents the " mark. Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values. Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values. Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values. Any comments?
        Hide
        Robert Muir added a comment -

        As we are not working in unquoted attributes

        You cannot make this determination. If you want to copy this method and put a less secure version in SimpleHTMLEncoder, thats cool with me.

        But don't make PostingsHighlighter less secure: -1 to that.

        Show
        Robert Muir added a comment - As we are not working in unquoted attributes You cannot make this determination. If you want to copy this method and put a less secure version in SimpleHTMLEncoder, thats cool with me. But don't make PostingsHighlighter less secure: -1 to that.
        Hide
        Uwe Schindler added a comment -

        I did not want to modify yours although I disagree.

        I will commit the current patch and remove the useless extra encoding.

        Show
        Uwe Schindler added a comment - I did not want to modify yours although I disagree. I will commit the current patch and remove the useless extra encoding.
        Hide
        Uwe Schindler added a comment -

        Attached is a new patch also escaping the single ' and the forwards slash (although the latter is not really required, but I did this to make Robert happy). I refuse to encode the Latin1 chars.

        I will commit this in a minute.

        Show
        Uwe Schindler added a comment - Attached is a new patch also escaping the single ' and the forwards slash (although the latter is not really required, but I did this to make Robert happy). I refuse to encode the Latin1 chars. I will commit this in a minute.
        Hide
        ASF subversion and git services added a comment -

        Commit 1518839 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1518839 ]

        LUCENE-5191: Fix Unicode corrumption in HTML escaping of Standard Highlighter and Fast Vector Highlighter.

        Show
        ASF subversion and git services added a comment - Commit 1518839 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1518839 ] LUCENE-5191 : Fix Unicode corrumption in HTML escaping of Standard Highlighter and Fast Vector Highlighter.
        Hide
        ASF subversion and git services added a comment -

        Commit 1518840 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1518840 ]

        Merged revision(s) 1518839 from lucene/dev/trunk:
        LUCENE-5191: Fix Unicode corrumption in HTML escaping of Standard Highlighter and Fast Vector Highlighter.

        Show
        ASF subversion and git services added a comment - Commit 1518840 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1518840 ] Merged revision(s) 1518839 from lucene/dev/trunk: LUCENE-5191 : Fix Unicode corrumption in HTML escaping of Standard Highlighter and Fast Vector Highlighter.
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development