Uploaded image for project: 'Commons Lang'
  1. Commons Lang
  2. LANG-617

StringEscapeUtils.escapeXML() can't process UTF-16 supplementary characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.4
    • 3.0
    • lang.*
    • None

    Description

      Supplementary characters in UTF-16 are those whose code points are above 0xffff, that is, require more than 1 Java char to be encoded, as explained here: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

      Currently, StringEscapeUtils.escapeXML() isn't aware of this coding scheme and treats each char as one character, which is not always right.

      A possible solution in class Entities would be:

      public void escape(Writer writer, String str) throws IOException {
      int len = str.length();
      for (int i = 0; i < len; i++) {
      int code = str.codePointAt;
      String entityName = this.entityName(code);
      if (entityName != null)

      { writer.write('&'); writer.write(entityName); writer.write(';'); }

      else if (code > 0x7F)

      { writer.write("&#"); writer.write(code); writer.write(';'); }

      else

      { writer.write((char) code); }

      if (code > 0xffff)

      { i++; }

      }
      }

      Besides fixing escapeXML(), this will also affect HTML escaping functions. I guess that's a good thing, but please remember I have only tested escapeXML().

      Attachments

        Activity

          People

            Unassigned Unassigned
            dav.garcia David Garcia
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: