Uploaded image for project: 'Santuario'
  1. Santuario
  2. SANTUARIO-307

utf8 encode is broken

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Java 1.4.4
    • Fix Version/s: Java 1.5.7, Java 2.0.1
    • Component/s: Java
    • Security Level: Public (Public issues, viewable by everyone)
    • Environment:
      Ubuntu 10.04 LTS 32 bit
      JRE 5.x

      Description

      This code:
      if ((c >= 0xD800 && c <= 0xDBFF) || (c >= 0xDC00 && c <= 0xDFFF) ){
      //No Surrogates in sun java
      out.write(0x3f);
      return;
      }
      from UtfHelpper.writeCharToUtf8 and other methods in UtfHelpper seems to be excluding these 3 unicode blocks:
      http://www.fileformat.info/info/unicode/block/high_surrogates/index.htm
      http://www.fileformat.info/info/unicode/block/high_private_use_surrogates/index.htm
      http://www.fileformat.info/info/unicode/block/low_surrogates/index.htm

      The problem is that some characters from other sections fall in that range when encoded in UTF-16. For example this character 0x0002000B encoded as UTF16 is 0xD840 0xDC0B.
      http://www.fileformat.info/info/unicode/char/2000b/index.htm

      This causes the output to be corrupted as ? in some cases.

      here is a code sample:
      Canonicalizer c14n = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_EXCL_OMIT_COMMENTS);

      String doc1 = "<a>\u4E1F</a>";
      String doc2 = "<a>\uD840\uDC0B</a>";

      System.out.println("doc1 before:" + doc1);
      byte [] output = c14n.canonicalize(doc1.getBytes("UTF8"));
      System.out.println("doc1 after:" + new String(output, "UTF8"));

      System.out.println("doc2 before:" + doc2);
      output = c14n.canonicalize(doc2.getBytes("UTF8"));
      System.out.println("doc2 after:" + new String(output, "UTF8"));

      the output is:
      doc1 before:<a>丟</a>
      doc1 after:<a>丟</a>
      doc2 before:<a>𠀋</a>
      doc2 after:<a>??</a>

      Notice that "doc2 after" corrupted, as <a>??</a> instead of <a>𠀋</a>

      Based on the code there does not seem to be any workaround.

        Attachments

        1. UtfHelpper.java
          8 kB
          TH Heung
        2. CanonicalizerBase.java
          31 kB
          TH Heung
        3. CanonicalizerBase.java
          28 kB
          TH Heung
        4. UtfHelperTest.java
          3 kB
          TH Heung
        5. UtfHelpper.java
          8 kB
          TH Heung
        6. UtfHelperTest.java
          4 kB
          TH Heung
        7. UtfHelpper.java
          9 kB
          TH Heung

          Activity

            People

            • Assignee:
              coheigea Colm O hEigeartaigh
              Reporter:
              xmlsec_user steve
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: