Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.5.7, 5.0 Beta3
-
None
Description
Unicode handling is badly broken, as the below examples show:
httpget.addHeader("X-I-Expect-This-Header", "Федор Достоевский") => X-I-Expect-This-Header: $54>@ >AB>52A:89
httpget.addHeader("X-I-Expect-This-Header", "宮本茂") => X-I-Expect-This-Header: �,
httpget.addHeader("X-I-Expect-This-Header", "Ἀριστοτέλης") => X-I-Expect-This-Header:���Ŀĭ���
The root cause is here:
for (int i1 = off, i2 = oldlen; i2 < newlen; i1++, i2++) { this.array[i2] = (byte) b[i1]; }
In this code, b is of type char[] and array is of type byte[]. According to JLS § 5.1.3 ("Narrowing Primitive Conversion"), "[a] narrowing conversion of a char to an integral type T likewise simply discards all but the n lowest order bits, where n is the number of bits used to represent type T."
There are a few ways we could fix this, and any of them would be better than what we are doing now. The two I'll propose for consideration are:
- Just write UTF-8 to the wire; non-ASCII characters should be tolerated as obs-text
- Replace non-ASCII characters with an empty string, space, or question mark
See also: https://issues.apache.org/jira/browse/HTTPCLIENT-1974