[SANTUARIO-307] utf8 encode is broken - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Java 1.4.4
Fix Version/s: Java 1.5.7, Java 2.0.1
Component/s: Java
Security Level: Public (Public issues, viewable by everyone)
Labels:
Environment:
Ubuntu 10.04 LTS 32 bit
JRE 5.x

Description

This code:
if ((c >= 0xD800 && c <= 0xDBFF) || (c >= 0xDC00 && c <= 0xDFFF) ){
//No Surrogates in sun java
out.write(0x3f);
return;
}
from UtfHelpper.writeCharToUtf8 and other methods in UtfHelpper seems to be excluding these 3 unicode blocks:
http://www.fileformat.info/info/unicode/block/high_surrogates/index.htm
http://www.fileformat.info/info/unicode/block/high_private_use_surrogates/index.htm
http://www.fileformat.info/info/unicode/block/low_surrogates/index.htm

The problem is that some characters from other sections fall in that range when encoded in UTF-16. For example this character 0x0002000B encoded as UTF16 is 0xD840 0xDC0B.
http://www.fileformat.info/info/unicode/char/2000b/index.htm

This causes the output to be corrupted as ? in some cases.

here is a code sample:
Canonicalizer c14n = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_EXCL_OMIT_COMMENTS);

String doc1 = "<a>\u4E1F</a>";
String doc2 = "<a>\uD840\uDC0B</a>";

System.out.println("doc1 before:" + doc1);
byte [] output = c14n.canonicalize(doc1.getBytes("UTF8"));
System.out.println("doc1 after:" + new String(output, "UTF8"));

System.out.println("doc2 before:" + doc2);
output = c14n.canonicalize(doc2.getBytes("UTF8"));
System.out.println("doc2 after:" + new String(output, "UTF8"));

the output is:
doc1 before:<a>丟</a>
doc1 after:<a>丟</a>
doc2 before:<a>𠀋</a>
doc2 after:<a>??</a>

Notice that "doc2 after" corrupted, as <a>??</a> instead of <a>𠀋</a>

Based on the code there does not seem to be any workaround.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

UtfHelpper.java
30/May/14 04:20
8 kB
TH Heung
UtfHelpper.java
30/May/14 04:22
8 kB
TH Heung
UtfHelpper.java
18/Jun/14 09:22
9 kB
TH Heung
UtfHelperTest.java
30/May/14 04:22
3 kB
TH Heung
UtfHelperTest.java
18/Jun/14 09:22
4 kB
TH Heung
CanonicalizerBase.java
30/May/14 04:22
31 kB
TH Heung
CanonicalizerBase.java
30/May/14 04:22
28 kB
TH Heung

Activity

People

Assignee:: Colm O hEigeartaigh

Reporter:: steve

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Mar/12 17:41

Updated:: 01/Jul/14 12:10

Resolved:: 18/Jun/14 10:23