Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
JDK: openjdk-8-jdk Version 8u242-b08-0ubuntu3~18.04 on Ubuntu 18.04 amd64
The ICU4J library was used for processing Unicode correctly: See dependencies in POM
-
Patch Available
-
Moderate
-
Patch
Description
Hi,
AFAIK all versions of came are affected by the following bug: Camel counts the chars in the fixed length data format wrongly.
Unicode is a bit tricky, when it comes to counting the length of a string specially since Java uses internally UTF-16, which means depending on the codepoint 1 - 2 (Java-)chars. Bindy seems to use internally for selection substring and counts chars like Java does. This means the length of a string is the count of the chars, i.e. UTF-16 surrogates, but not codepoints, which is the common denominator (e.g. see definition of string length in XMLSchema). And when one takes combing chars into account (one "base char" plus 0 - n combining chars are perceived as one "char" by users) it becomes even more of a problem.
Fixed length data format is totally dependent on counting chars correctly, which makes it unsuable if the chars are not correctly counted, since it cannot recover for "colums" to the right.
See also the mailing list at http://mail-archives.apache.org/mod_mbox/camel-users/202001.mbox/browser
As suggested I created a pull request, since this may be of some interest for the community. The ICU4J lib was used, for processing Unicode correctly, since the functionality built into the Java API is too old to process modern emojis (skin colour, hair, sex) correctly. Please watch the license...
Pull-request: https://github.com/apache/camel/pull/3552