Bug 47610 - New hyphenation patterns
Summary: New hyphenation patterns
Status: CLOSED FIXED
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: general (show other bugs)
Version: all
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-30 12:58 UTC by Simon Pepping
Modified: 2012-04-01 07:02 UTC (History)
0 users



Attachments
classes for hyphenation, generated from UnicodeData.txt (81.59 KB, text/plain)
2009-07-30 12:58 UTC, Simon Pepping
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Pepping 2009-07-30 12:58:53 UTC
Created attachment 24069 [details]
classes for hyphenation, generated from UnicodeData.txt

The TeX people are now moving to Unicode based TeX engines. Therefore they created new hyphenation pattern files in utf-8 encoding, see http://www.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/ and http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/. These pattern files can be directly transformed into XML format and used in FOP. I tested a few, and had no problems.

They lack one thing, however, classes. FOP uses classes to determine what is a letter (only words consisting of letters will be hyphenated) and the LC/UC mapping. TeX gets the classes from its Unicode setup, see e.g. http://scripts.sil.org/svn-public/xetex/TRUNK/texmf/tex/generic/xetex/unicode-letters.tex. I have tried to do the same, and I attach the result. These classes would be valid for each hyphenation pattern file. Some localizations seem to have their own variants of the LC/UC mapping, but I have not investigated that.

The classes were generated as follows: Roughly, each character that is its own LC generates a class. Its UC and TC (title case character) are added to the class. More precisely, the selection of characters generating a class was done as follows:
1. In the first plane,
2. Category Ll or Lu or Lt and its own LC character, or category Lo,
3. Not in the following blocks: Superscripts and Subscripts, Letterlike Symbols, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms, CJK Unified Ideographs, CJK Unified Ideographs Extension A, Hangul Syllables.

We can do two things: Add these classes to each hyphenation file, or add them to the code that generates the hyphenation trie, preferably to be read from a separate file. I prefer the latter option. What do you think?
Comment 1 Simon Pepping 2009-07-31 12:29:53 UTC
Of course, I can also XInclude the classes into the pattern files. The question is if the classes are more a property of FOP or of the hyphenation pattern files.
Comment 2 Vincent Hennebert 2009-08-06 02:45:16 UTC
Hi Simon,

(In reply to comment #0)
> Created an attachment (id=24069) [details]
> classes for hyphenation, generated from UnicodeData.txt
<snip/> 
> We can do two things: Add these classes to each hyphenation file, or add them
> to the code that generates the hyphenation trie, preferably to be read from a
> separate file. I prefer the latter option. What do you think?

Not that I know much about hyphenation, but the latter option also looks preferable to me. I'd say that the classes belong in FOP.

Vincent
Comment 3 Simon Pepping 2010-05-25 05:26:21 UTC
Fixed in revision 805561 and some later revisions, finally in revision 810896
Comment 4 Glenn Adams 2012-04-01 07:02:12 UTC
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed