The PatternParser.characters method used for parsing patterns and other things from hyphenation XML files can't cope with the parser splitting text in multiple character events. This may lead to patterns crossing a buffer boundary to be parsed as two wrong patterns. Furthermore, the implementation is quite ineffective: - copying the characters into a StringBuffer is unnecessary - the tokenizer moves the whole array The readToken is declared to return a string, but it always returns null and stores the token in a class variable (horrible design).
Umm, delete the comment about readToken always returning null. The design is still somewhat horrible.
Further points - Using a ternary tree for the charclass arrays seems to be wasteful. Two parallel arrays and a Array binary search should be sufficient. - The charclass parser wont check whether a . (dot) represents a class. The dot is reserved as begin/end of word marker in patterns, using it als class representation will probably cause problems. - The pattern parser wont check whether the non-digits in the patterns are actually charclass representations.
resetting P2 open bugs to P3 pending further review