Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
Java 1.4.4
-
Security Level: Public (Public issues, viewable by everyone)
-
Ubuntu 10.04 LTS 32 bit
JRE 5.x
Description
This code:
if ((c >= 0xD800 && c <= 0xDBFF) || (c >= 0xDC00 && c <= 0xDFFF) ){
//No Surrogates in sun java
out.write(0x3f);
return;
}
from UtfHelpper.writeCharToUtf8 and other methods in UtfHelpper seems to be excluding these 3 unicode blocks:
http://www.fileformat.info/info/unicode/block/high_surrogates/index.htm
http://www.fileformat.info/info/unicode/block/high_private_use_surrogates/index.htm
http://www.fileformat.info/info/unicode/block/low_surrogates/index.htm
The problem is that some characters from other sections fall in that range when encoded in UTF-16. For example this character 0x0002000B encoded as UTF16 is 0xD840 0xDC0B.
http://www.fileformat.info/info/unicode/char/2000b/index.htm
This causes the output to be corrupted as ? in some cases.
here is a code sample:
Canonicalizer c14n = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_EXCL_OMIT_COMMENTS);
String doc1 = "<a>\u4E1F</a>";
String doc2 = "<a>\uD840\uDC0B</a>";
System.out.println("doc1 before:" + doc1);
byte [] output = c14n.canonicalize(doc1.getBytes("UTF8"));
System.out.println("doc1 after:" + new String(output, "UTF8"));
System.out.println("doc2 before:" + doc2);
output = c14n.canonicalize(doc2.getBytes("UTF8"));
System.out.println("doc2 after:" + new String(output, "UTF8"));
the output is:
doc1 before:<a>丟</a>
doc1 after:<a>丟</a>
doc2 before:<a>𠀋</a>
doc2 after:<a>??</a>
Notice that "doc2 after" corrupted, as <a>??</a> instead of <a>𠀋</a>
Based on the code there does not seem to be any workaround.