Bug 36977 - [PATCH] TextLayoutManager CJK line break
Summary: [PATCH] TextLayoutManager CJK line break
Status: NEW
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: page-master/layout (show other bugs)
Version: trunk
Hardware: Other other
: P2 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-09 11:19 UTC by Jingjing Lee
Modified: 2012-04-18 07:07 UTC (History)
1 user (show)



Attachments
Patch for TextLayoutManager.java (8.36 KB, patch)
2005-10-09 11:21 UTC, Jingjing Lee
Details | Diff
chinese example, includes fo, pdf before and after patch (26.82 KB, application/zip)
2005-10-11 07:24 UTC, Jingjing Lee
Details
new patch (13.86 KB, patch)
2005-10-14 04:55 UTC, Jingjing Lee
Details | Diff
line break patch (11.37 KB, patch)
2005-11-01 06:49 UTC, Jingjing Lee
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jingjing Lee 2005-10-09 11:19:09 UTC
The patch add FOP ability to handle CJK line breaking.
It use java.text.BreakIterator to determine all line breakable points in space
separated words. Between 2 CJK indeographics, a KnuthGlue is inserted, which
allow 1pt shrink. Between CJK indeographic and Latine letter, a KnuthGlue is
inserted, which advise half wordspace. And some CJK indeographic that glyph
occupy halfwidth is set halfwidth shrinkable.

the patch is based on revision 307247. see attachment.
Comment 1 Jingjing Lee 2005-10-09 11:21:49 UTC
Created attachment 16627 [details]
Patch for TextLayoutManager.java
Comment 2 Manuel Mall 2005-10-09 12:35:53 UTC
Thank you very much for submitting this patch. Would it be possible for you to 
also add a sample fo file demonstrating the effect of the patch? I believe 
there is little experience within the current group of committers with respect 
to non western scripts. Therefore I would appreaciate if you could briefly 
summarise for the uninitiated Westerner the line breaking rules for CJK.

Thanks again for you interest in and support of FOP.
Comment 3 Jingjing Lee 2005-10-11 06:59:22 UTC
Every CJK indeographic is a LETTER. A sentence is a sequence of letters and
punctuations with NO space required between them.

Line breaking rules:
A feasible breaking point befor and after a CJK letter, except the follwing
punctuations rules.

Line breaking is NOT allowed after a punctuation which belongs:
unicode General Category Pi [Punctuation, Initial quote]
unicode General Category Ps [Punctuation, Open]

Line breaking is NOT allowed before a punctuation which belongs:
unicode General Category Pf [Punctuation, Final quote]
unicode General Category Pe [Punctuation, Close]
unicode General Category Po [Punctuation, Other] and Line Break IS [Infix
Separator (Numeric)]
unicode General Category Po [Punctuation, Other] and Line Break CL [Closing
Punctuation]
unicode General Category Po [Punctuation, Other] and Line Break NS [Non Starter]
unicode General Category Po [Punctuation, Other] and Line Break EX
[Exclamation/Interrogation]

The rules is trivial. So it's better to let java.text.BreakIterator take care of it.


Typesetting Adjustment: (For Chinese, Japanese and Korea is uncertain)
Terms:
	HALFWIDTH: most western character, width is about half of height.
	FULLWIDTH: most CJK character, width is about equal height.
	LEFTHALFGLYPH: some CJK punctuation, the glyph only occupy the left half width.
such as \u300B \u201D
	RIGHTHALFGLYPH: some CJK punctuation, the glyph only occupy the right half
width. such as \u300A \u201C
NOTE: some letters, for example \u201C, is HALFWIDTH in western fonts. In this
case, it's glyph occupy whole width, not be considered as HALFGLYPH.

Adjustment Rules:
R1. If a RIGHTHALFGLYPH follows by a RIGHTHALFGLYPH(IDEOGRAPHIC_FULL_STOP,
RIGHT_DOUBLE_QUOTATION_MARK), the first one should be compressed to it's
minimum, half width.(opt=half_width, shrink=0)
A LEFTHALFGLYPH follows by a LEFTHALFGLYPH would not occur in normal text. Don't
consider.

R2. Between a FULLWIDTH letter and a HALFWIDTH letter or digit, a small sapce is
advised. The samll sapce is set to (opt=half_sapce, shrink=half_sapce)

R3. Between 2 FULLWIDTH letter, a small shrink is allowed. 1/16 of width would
be good to avoid glyph overlapping. The feature should be configurable.

R4. When a line stretch, all feasible breaking point is stretchable. The stretch
of a glue between 2 CJK Letter should be smaller than the glue following a
punctuation. The ratio is not considered yet. Currently, the TextLayoutManager
is not consider it even for western style layout(space between words/space
following sentence end dot). 

R5. When a line shink, first compress HALFGLYPH, HALFGLYPH can be compressed to
half width. 2nd, compress the samll sapce between FULLWIDTH letter and HALFWIDTH
letter or digit (R2). 3rd, shrink the letterspace(R3).
The shrink priority is hard to implement unless modify the knuth algorithm.
Comment 4 Jingjing Lee 2005-10-11 07:24:46 UTC
Created attachment 16651 [details]
chinese example, includes fo, pdf before and after patch
Comment 5 Jingjing Lee 2005-10-14 04:55:10 UTC
Created attachment 16694 [details]
new patch

This new patch fulfill all rules mentioned in Comment #3 except Typesetting
Adjustment R3 and shrink priority of R5.
It use java.text.BreakIterator.getLineInstance(locale) to decide line breaking
points, and use BreakIterator.getWordInstance(locale) to decide word boundary
for implementing Typesetting Adjustment R1.

Now, line breaking works well. 

* ISSUE NEED HELP *
All the width of KnuthGlues inserted for Typesetting Adjustment appear at line
end instead of their inserted position. I think I make mistakes about KnuthGlue
and AreaInfo. I'm trying to dig it out. however, I hope someone can give me
some hints. Is it caused by KnuthGlue/AreaInfo has no corresponding string?
Comment 6 Luca Furini 2005-10-14 16:35:15 UTC
I don't have the font you use in your example, so I cannot directly test it to
see what happens.

Anyway, if you could attach the list of elements created it could be easier to
see if they are correct.

For example, you could put this method in LineLayoutManager.java ...

private void outputElementList(Paragraph par) {
    System.out.println("");
    System.out.println("paragraph start");
    ListIterator iter = par.listIterator();
    KnuthElement el;
    while (iter.hasNext()) {
        el = (KnuthElement) iter.next();
        if (el.isBox()) {
            System.out.println(iter.previousIndex() + ") box w=" + el.getW());
        } else if (el.isGlue()) {
            System.out.println(iter.previousIndex() + ") gllue w=" + el.getW() +
" y=" + el.getY() + " z=" + el.getZ());
        } else {
            System.out.println(iter.previousIndex() + ") penalty w=" + el.getW()
+ " p=" + el.getP());
        }
    }
    System.out.println("paragraph end");
}

... call it in createLineBreaks() just before findOptimalBreakingPoints(), with
seq as parameter, and attach the resulting output file.

Regards
    Luca
Comment 7 Luca Furini 2005-10-18 12:40:33 UTC
The problem you mention (all the adjustment space appears at the end of the
line, in other words the line is not justified) is probably due to the bug 36238.

Regards
    Luca
Comment 8 Manuel Mall 2005-10-25 15:13:17 UTC
I had a look at the patch and think it would be really great to get a proper 
Unicode / international line breaking algorithm into FOP.

Unfortunately there seems to be a bug in the code which breaks the current 
layout engine test suite. The test text-decoration_1.xml fails. I think it has 
to do with text like "/over/under/through" and FOPs handling of "/" as a 
breaking char somehow conflicts with your work.

I am also uncomfortable with the triple breaking being done: First FOP natively 
breaks text into chunks based on sp, nbsp, lf, and the like. Then you take 
those chunks and break them using a BreakIterator in line mode. After that 
those pieces are broken again using another BreakIterator in word mode. That 
seem to be a lot of iterations over those characters.

Would it be possible / sensible to have a single breaker which does all three 
things?

May be we should simplify the problem and concentrate on line breaking first. 
Leaving the type setting fine tuning to a subsequent patch once we are happy 
with the line breaking?

What do you think?
Comment 9 Luca Furini 2005-10-26 12:31:33 UTC
It seems that the new method createElementsForLineBoundary() is called and
appends elements even if there are no cjk characters, and I think this should
not happen.

When I tried applying the patch some days ago, the testcases concerning
hyphenation failed too: the output had both missing and repeated pieces of text. 
Comment 10 Jingjing Lee 2005-11-01 06:49:29 UTC
Created attachment 16846 [details]
line break patch

This is a line break patch without typesetting fine tuning.
It still base on BreakIterator and satisfy testcases.
Sorry for my previous buggy patch.
Befor the new Unicode Line Breaking algorithm available,
the patch make it possible to test FOP with CJK characters.
Comment 11 Glenn Adams 2012-04-11 03:20:39 UTC
increase priority for bugs with a patch