Harmony
  1. Harmony
  2. HARMONY-137

CharsetDecoder should replace undefined bytes with replacement string

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Classlib
    • Labels:
      None

      Description

      Corresponding to cp1250 mapping table, 0x81 byte is undefined. See http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT
      So, charset decoder should replace undefined bytes with default replacement, i.e. 0xFFFD.
      Testcase for reproducing this issue:

      import java.nio.charset.*;
      import java.nio.*;

      public class Harmony137 {
      public static void main(String[] args) throws Exception {
      ByteBuffer bb = ByteBuffer.allocate(5);
      bb.put((byte)0x81); bb.flip();
      Charset cp1250 = Charset.forName("cp1250");
      CharBuffer cb = cp1250.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE).decode(bb);
      if(cb.get(0)!=65533)

      { System.out.println("FAIL: expected 0xFFFD but result is: 0x"+Integer.toHexString(cb.get(0)).toUpperCase()); }

      }
      }

        Activity

        Hide
        Paulex Yang added a comment -

        A little investigation on the Harmony codes, seems it is caused by problems of ICU4JNI decoder provider, the following test cases shows that. RI "cp1250" passes the testcase while ICU "cp1250" fails under either RI or Harmony . I'll try to report the bug to ICU.

        Test case:

        public void testDecode_JIRA137() {
        ByteBuffer bb = ByteBuffer.allocate(5);
        bb.put((byte) 0x81);
        bb.flip();
        // Use ICU cp1250 charset
        CharsetProviderICU provider = new CharsetProviderICU();
        Charset cp1250 = provider.charsetForName("cp1250");
        // Uncomment code below to use RI charset
        //cp1250 = Charset.forName("cp1250");
        CharBuffer cb;
        try

        { cb = cp1250.newDecoder() .onMalformedInput(CodingErrorAction.REPLACE) .onUnmappableCharacter(CodingErrorAction.REPLACE) .decode(bb); assertEquals(0XFFFD,cb.get(0)); }

        catch (CharacterCodingException e)

        { e.printStackTrace(); }

        }

        Show
        Paulex Yang added a comment - A little investigation on the Harmony codes, seems it is caused by problems of ICU4JNI decoder provider, the following test cases shows that. RI "cp1250" passes the testcase while ICU "cp1250" fails under either RI or Harmony . I'll try to report the bug to ICU. Test case: public void testDecode_JIRA137() { ByteBuffer bb = ByteBuffer.allocate(5); bb.put((byte) 0x81); bb.flip(); // Use ICU cp1250 charset CharsetProviderICU provider = new CharsetProviderICU(); Charset cp1250 = provider.charsetForName("cp1250"); // Uncomment code below to use RI charset //cp1250 = Charset.forName("cp1250"); CharBuffer cb; try { cb = cp1250.newDecoder() .onMalformedInput(CodingErrorAction.REPLACE) .onUnmappableCharacter(CodingErrorAction.REPLACE) .decode(bb); assertEquals(0XFFFD,cb.get(0)); } catch (CharacterCodingException e) { e.printStackTrace(); } }
        Hide
        Richard Liang added a comment -

        Please see the bug info in ICU bug system: http://bugs.icu-project.org/cgi-bin/icu-bugs?findid=5085&go=Go

        And attached here is ICU team's response to this bug:

        You are expecting incorrect behavior from cp1250. Both Microsoft's conversion APIs and IBM mapping tables convert byte 81 to Unicode character 0081. This conversion behavior will not change. The tables on unicode.org may tell you about the official mappings, but there are other mappings that are commonly expected.

        More details about ICU charset conversion can be found on this page: http://icu.sourceforge.net/charts/charset/

        This charset conversion works as expected.

        Show
        Richard Liang added a comment - Please see the bug info in ICU bug system: http://bugs.icu-project.org/cgi-bin/icu-bugs?findid=5085&go=Go And attached here is ICU team's response to this bug: You are expecting incorrect behavior from cp1250. Both Microsoft's conversion APIs and IBM mapping tables convert byte 81 to Unicode character 0081. This conversion behavior will not change. The tables on unicode.org may tell you about the official mappings, but there are other mappings that are commonly expected. More details about ICU charset conversion can be found on this page: http://icu.sourceforge.net/charts/charset/ This charset conversion works as expected.
        Hide
        Tim Ellison added a comment -

        For the reasons Richard and the ICU team give, this is being marked as won't fix.

        Show
        Tim Ellison added a comment - For the reasons Richard and the ICU team give, this is being marked as won't fix.
        Hide
        Vladimir Strigun added a comment -

        Tim, I agree with the resolution, please close it.

        Show
        Vladimir Strigun added a comment - Tim, I agree with the resolution, please close it.
        Hide
        Tim Ellison added a comment -

        Verified by Vladimir.

        Show
        Tim Ellison added a comment - Verified by Vladimir.

          People

          • Assignee:
            Unassigned
            Reporter:
            Vladimir Strigun
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development