Commons Lang
  1. Commons Lang
  2. LANG-480

StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 3.0
    • Component/s: lang.*
    • Labels:
      None
    • Environment:

      doesn't matter

      Description

      Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

      import org.apache.commons.lang.*;

      public class J2 {
      public static void main(String[] args) throws Exception {
      // this is the utf8 representation of the character:
      // COUNTING ROD UNIT DIGIT THREE
      // in unicode
      // codepoint: U+1D362
      byte[] data = new byte[]

      { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 }

      ;

      //output is: ��
      // should be: 𝍢
      System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
      }
      }

      Should be very quick to fix, feel free to drop me an email if you want a patch.

      1. lang-480.patch
        1 kB
        Alexander Kjäll

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Alexander Kjäll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development