The patch add FOP ability to handle CJK line breaking. It use java.text.BreakIterator to determine all line breakable points in space separated words. Between 2 CJK indeographics, a KnuthGlue is inserted, which allow 1pt shrink. Between CJK indeographic and Latine letter, a KnuthGlue is inserted, which advise half wordspace. And some CJK indeographic that glyph occupy halfwidth is set halfwidth shrinkable. the patch is based on revision 307247. see attachment.
Created attachment 16627 [details] Patch for TextLayoutManager.java
Thank you very much for submitting this patch. Would it be possible for you to also add a sample fo file demonstrating the effect of the patch? I believe there is little experience within the current group of committers with respect to non western scripts. Therefore I would appreaciate if you could briefly summarise for the uninitiated Westerner the line breaking rules for CJK. Thanks again for you interest in and support of FOP.
Every CJK indeographic is a LETTER. A sentence is a sequence of letters and punctuations with NO space required between them. Line breaking rules: A feasible breaking point befor and after a CJK letter, except the follwing punctuations rules. Line breaking is NOT allowed after a punctuation which belongs: unicode General Category Pi [Punctuation, Initial quote] unicode General Category Ps [Punctuation, Open] Line breaking is NOT allowed before a punctuation which belongs: unicode General Category Pf [Punctuation, Final quote] unicode General Category Pe [Punctuation, Close] unicode General Category Po [Punctuation, Other] and Line Break IS [Infix Separator (Numeric)] unicode General Category Po [Punctuation, Other] and Line Break CL [Closing Punctuation] unicode General Category Po [Punctuation, Other] and Line Break NS [Non Starter] unicode General Category Po [Punctuation, Other] and Line Break EX [Exclamation/Interrogation] The rules is trivial. So it's better to let java.text.BreakIterator take care of it. Typesetting Adjustment: (For Chinese, Japanese and Korea is uncertain) Terms: HALFWIDTH: most western character, width is about half of height. FULLWIDTH: most CJK character, width is about equal height. LEFTHALFGLYPH: some CJK punctuation, the glyph only occupy the left half width. such as \u300B \u201D RIGHTHALFGLYPH: some CJK punctuation, the glyph only occupy the right half width. such as \u300A \u201C NOTE: some letters, for example \u201C, is HALFWIDTH in western fonts. In this case, it's glyph occupy whole width, not be considered as HALFGLYPH. Adjustment Rules: R1. If a RIGHTHALFGLYPH follows by a RIGHTHALFGLYPH(IDEOGRAPHIC_FULL_STOP, RIGHT_DOUBLE_QUOTATION_MARK), the first one should be compressed to it's minimum, half width.(opt=half_width, shrink=0) A LEFTHALFGLYPH follows by a LEFTHALFGLYPH would not occur in normal text. Don't consider. R2. Between a FULLWIDTH letter and a HALFWIDTH letter or digit, a small sapce is advised. The samll sapce is set to (opt=half_sapce, shrink=half_sapce) R3. Between 2 FULLWIDTH letter, a small shrink is allowed. 1/16 of width would be good to avoid glyph overlapping. The feature should be configurable. R4. When a line stretch, all feasible breaking point is stretchable. The stretch of a glue between 2 CJK Letter should be smaller than the glue following a punctuation. The ratio is not considered yet. Currently, the TextLayoutManager is not consider it even for western style layout(space between words/space following sentence end dot). R5. When a line shink, first compress HALFGLYPH, HALFGLYPH can be compressed to half width. 2nd, compress the samll sapce between FULLWIDTH letter and HALFWIDTH letter or digit (R2). 3rd, shrink the letterspace(R3). The shrink priority is hard to implement unless modify the knuth algorithm.
Created attachment 16651 [details] chinese example, includes fo, pdf before and after patch
Created attachment 16694 [details] new patch This new patch fulfill all rules mentioned in Comment #3 except Typesetting Adjustment R3 and shrink priority of R5. It use java.text.BreakIterator.getLineInstance(locale) to decide line breaking points, and use BreakIterator.getWordInstance(locale) to decide word boundary for implementing Typesetting Adjustment R1. Now, line breaking works well. * ISSUE NEED HELP * All the width of KnuthGlues inserted for Typesetting Adjustment appear at line end instead of their inserted position. I think I make mistakes about KnuthGlue and AreaInfo. I'm trying to dig it out. however, I hope someone can give me some hints. Is it caused by KnuthGlue/AreaInfo has no corresponding string?
I don't have the font you use in your example, so I cannot directly test it to see what happens. Anyway, if you could attach the list of elements created it could be easier to see if they are correct. For example, you could put this method in LineLayoutManager.java ... private void outputElementList(Paragraph par) { System.out.println(""); System.out.println("paragraph start"); ListIterator iter = par.listIterator(); KnuthElement el; while (iter.hasNext()) { el = (KnuthElement) iter.next(); if (el.isBox()) { System.out.println(iter.previousIndex() + ") box w=" + el.getW()); } else if (el.isGlue()) { System.out.println(iter.previousIndex() + ") gllue w=" + el.getW() + " y=" + el.getY() + " z=" + el.getZ()); } else { System.out.println(iter.previousIndex() + ") penalty w=" + el.getW() + " p=" + el.getP()); } } System.out.println("paragraph end"); } ... call it in createLineBreaks() just before findOptimalBreakingPoints(), with seq as parameter, and attach the resulting output file. Regards Luca
The problem you mention (all the adjustment space appears at the end of the line, in other words the line is not justified) is probably due to the bug 36238. Regards Luca
I had a look at the patch and think it would be really great to get a proper Unicode / international line breaking algorithm into FOP. Unfortunately there seems to be a bug in the code which breaks the current layout engine test suite. The test text-decoration_1.xml fails. I think it has to do with text like "/over/under/through" and FOPs handling of "/" as a breaking char somehow conflicts with your work. I am also uncomfortable with the triple breaking being done: First FOP natively breaks text into chunks based on sp, nbsp, lf, and the like. Then you take those chunks and break them using a BreakIterator in line mode. After that those pieces are broken again using another BreakIterator in word mode. That seem to be a lot of iterations over those characters. Would it be possible / sensible to have a single breaker which does all three things? May be we should simplify the problem and concentrate on line breaking first. Leaving the type setting fine tuning to a subsequent patch once we are happy with the line breaking? What do you think?
It seems that the new method createElementsForLineBoundary() is called and appends elements even if there are no cjk characters, and I think this should not happen. When I tried applying the patch some days ago, the testcases concerning hyphenation failed too: the output had both missing and repeated pieces of text.
Created attachment 16846 [details] line break patch This is a line break patch without typesetting fine tuning. It still base on BreakIterator and satisfy testcases. Sorry for my previous buggy patch. Befor the new Unicode Line Breaking algorithm available, the patch make it possible to test FOP with CJK characters.
increase priority for bugs with a patch