Commons Lang
  1. Commons Lang
  2. LANG-480

StringEscapeUtils.escapeHtml incorrectly converts unicode characters above U+00FFFF into 2 characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 3.0
    • Component/s: lang.*
    • Labels:
      None
    • Environment:

      doesn't matter

      Description

      Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

      import org.apache.commons.lang.*;

      public class J2 {
      public static void main(String[] args) throws Exception {
      // this is the utf8 representation of the character:
      // COUNTING ROD UNIT DIGIT THREE
      // in unicode
      // codepoint: U+1D362
      byte[] data = new byte[]

      { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 }

      ;

      //output is: ��
      // should be: 𝍢
      System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
      }
      }

      Should be very quick to fix, feel free to drop me an email if you want a patch.

      1. lang-480.patch
        1 kB
        Alexander Kjäll

        Activity

        Alexander Kjäll created issue -
        Hide
        Alexander Kjäll added a comment -

        Here is a fix for the problem i think

        Show
        Alexander Kjäll added a comment - Here is a fix for the problem i think
        Alexander Kjäll made changes -
        Field Original Value New Value
        Attachment lang-480.patch [ 12398381 ]
        Hide
        Alexander Kjäll added a comment -

        And of course you shouldn't develop before you dring your coffe. Here is a working patch for the same problem.

        Show
        Alexander Kjäll added a comment - And of course you shouldn't develop before you dring your coffe. Here is a working patch for the same problem.
        Alexander Kjäll made changes -
        Attachment lang-480.patch [ 12398388 ]
        Alexander Kjäll made changes -
        Attachment lang-480.patch [ 12398381 ]
        Hide
        Joerg Schaible added a comment -

        Unfortunately this patch will have to wait until minimum requirement for commons-lang is JDK 5. Currently commons-lang is still compatible to JDK 1.2. However, talk for a JDK 5 version has already started.

        Show
        Joerg Schaible added a comment - Unfortunately this patch will have to wait until minimum requirement for commons-lang is JDK 5. Currently commons-lang is still compatible to JDK 1.2. However, talk for a JDK 5 version has already started.
        Hide
        Alexander Kjäll added a comment -

        That is a bit sad.

        How likely do you think that the JDK 5 version to be, will it happen within this quarter?

        I guess i could try to write a patch that is compatible with java 1.2, but that would require me to do my own parsing of the format that java stores characters in memory, so i would really like to avoid having that code in a library.

        Show
        Alexander Kjäll added a comment - That is a bit sad. How likely do you think that the JDK 5 version to be, will it happen within this quarter? I guess i could try to write a patch that is compatible with java 1.2, but that would require me to do my own parsing of the format that java stores characters in memory, so i would really like to avoid having that code in a library.
        Hide
        Sebb added a comment -

        I've not looked at the code, so this may be nonsense -

        Perhaps you could make the processing conditional - if it finds it's running under JVM 1.5+, then use the JVM Method, otherwise ignore the problem?

        Show
        Sebb added a comment - I've not looked at the code, so this may be nonsense - Perhaps you could make the processing conditional - if it finds it's running under JVM 1.5+, then use the JVM Method, otherwise ignore the problem?
        Hide
        James Carman added a comment -

        Wouldn't you have to use reflection, then?

        Show
        James Carman added a comment - Wouldn't you have to use reflection, then?
        Hide
        Sebb added a comment -

        Yes, but AFAICT Class.getMethod() is available in Java 1.2.

        The method could be fetched in a static block.

        Show
        Sebb added a comment - Yes, but AFAICT Class.getMethod() is available in Java 1.2. The method could be fetched in a static block.
        Hide
        James Carman added a comment -

        Of course it is. My point was that we would be engaging in reflection nastiness and it might not be worth it. I would suggest that if Alexander needs a release sooner that they do an "internal" release from the trunk with the changes applied and then "upgrade" when we get a newer release out. I don't like the idea of building in the reflection stuff. We get no compiler checking that way and it leads to unreadable code.

        Show
        James Carman added a comment - Of course it is. My point was that we would be engaging in reflection nastiness and it might not be worth it. I would suggest that if Alexander needs a release sooner that they do an "internal" release from the trunk with the changes applied and then "upgrade" when we get a newer release out. I don't like the idea of building in the reflection stuff. We get no compiler checking that way and it leads to unreadable code.
        Hide
        Alexander Kjäll added a comment -

        Just my 2 cents, I don't need a release that fixes this bug, i stumbled on it by chance and wrote a patch so that the next person that have the same problem that i do won't have to dig through the library in order to understand what's going on.

        I'm mainly interested in fixing this because i don't like buggy software, but i totally agree that building in reflection stuff leads to more problems than it solves in the long run.

        My opinion on how to fix this is either push for the JDK 1.5 dependency, or write some code that parses the format the strings are stored in memory. The latter might sound complicated but i think it's quite straight forward.

        Show
        Alexander Kjäll added a comment - Just my 2 cents, I don't need a release that fixes this bug, i stumbled on it by chance and wrote a patch so that the next person that have the same problem that i do won't have to dig through the library in order to understand what's going on. I'm mainly interested in fixing this because i don't like buggy software, but i totally agree that building in reflection stuff leads to more problems than it solves in the long run. My opinion on how to fix this is either push for the JDK 1.5 dependency, or write some code that parses the format the strings are stored in memory. The latter might sound complicated but i think it's quite straight forward.
        Henri Yandell made changes -
        Fix Version/s 3.0 [ 12311714 ]
        Henri Yandell made changes -
        Remaining Estimate 4h [ 14400 ]
        Original Estimate 4h [ 14400 ]
        Henri Yandell made changes -
        Description Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

        import org.apache.commons.lang.*;

        public class J2 {
            public static void main(String[] args) throws Exception {
                // this is the utf8 representation of the character:
                // COUNTING ROD UNIT DIGIT THREE
                // in unicode
                // codepoint: U+1D362
                byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };

                //output is: ��
                // should be: 𝍢
                System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
            }
        }

        Should be very quick to fix, feel free to drop me an email if you want a patch.
        Characters that are represented as a 2 characters internaly by java are incorrectly converted by the function. The following test displays the problem quite nicely:

        import org.apache.commons.lang.*;

        public class J2 {
            public static void main(String[] args) throws Exception {
                // this is the utf8 representation of the character:
                // COUNTING ROD UNIT DIGIT THREE
                // in unicode
                // codepoint: U+1D362
                byte[] data = new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x8D, (byte)0xA2 };

                //output is: ��
                // should be: 𝍢
                System.out.println("'" + StringEscapeUtils.escapeHtml(new String(data, "UTF8")) + "'");
            }
        }

        Should be very quick to fix, feel free to drop me an email if you want a patch.
        Hide
        Henri Yandell added a comment -

        svn ci -m "Applying Alexander Kjall's patch from LANG-480; along with a unit test made from his example. Fixes unicode conversion above U+00FFFF being done into 2 characters"

        Sending src/java/org/apache/commons/lang/Entities.java
        Sending src/test/org/apache/commons/lang/StringEscapeUtilsTest.java
        Transmitting file data ..
        Committed revision 749095.

        Show
        Henri Yandell added a comment - svn ci -m "Applying Alexander Kjall's patch from LANG-480 ; along with a unit test made from his example. Fixes unicode conversion above U+00FFFF being done into 2 characters" Sending src/java/org/apache/commons/lang/Entities.java Sending src/test/org/apache/commons/lang/StringEscapeUtilsTest.java Transmitting file data .. Committed revision 749095.
        Henri Yandell made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Fixed [ 1 ]
        Henri Yandell made changes -
        Component/s lang.* [ 12313203 ]
        Mark Thomas made changes -
        Workflow jira [ 12450668 ] Default workflow, editable Closed status [ 12602321 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Alexander Kjäll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development