Created attachment 24069 [details] classes for hyphenation, generated from UnicodeData.txt The TeX people are now moving to Unicode based TeX engines. Therefore they created new hyphenation pattern files in utf-8 encoding, see http://www.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/ and http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/. These pattern files can be directly transformed into XML format and used in FOP. I tested a few, and had no problems. They lack one thing, however, classes. FOP uses classes to determine what is a letter (only words consisting of letters will be hyphenated) and the LC/UC mapping. TeX gets the classes from its Unicode setup, see e.g. http://scripts.sil.org/svn-public/xetex/TRUNK/texmf/tex/generic/xetex/unicode-letters.tex. I have tried to do the same, and I attach the result. These classes would be valid for each hyphenation pattern file. Some localizations seem to have their own variants of the LC/UC mapping, but I have not investigated that. The classes were generated as follows: Roughly, each character that is its own LC generates a class. Its UC and TC (title case character) are added to the class. More precisely, the selection of characters generating a class was done as follows: 1. In the first plane, 2. Category Ll or Lu or Lt and its own LC character, or category Lo, 3. Not in the following blocks: Superscripts and Subscripts, Letterlike Symbols, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms, CJK Unified Ideographs, CJK Unified Ideographs Extension A, Hangul Syllables. We can do two things: Add these classes to each hyphenation file, or add them to the code that generates the hyphenation trie, preferably to be read from a separate file. I prefer the latter option. What do you think?
Of course, I can also XInclude the classes into the pattern files. The question is if the classes are more a property of FOP or of the hyphenation pattern files.
Hi Simon, (In reply to comment #0) > Created an attachment (id=24069) [details] > classes for hyphenation, generated from UnicodeData.txt <snip/> > We can do two things: Add these classes to each hyphenation file, or add them > to the code that generates the hyphenation trie, preferably to be read from a > separate file. I prefer the latter option. What do you think? Not that I know much about hyphenation, but the latter option also looks preferable to me. I'd say that the classes belong in FOP. Vincent
Fixed in revision 805561 and some later revisions, finally in revision 810896
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed