Ok, so here is the patch I was talking about in the fop-dev mailing list. I'm going to attach: - the patch concerning existing files - 5 new .java files (diff did not consider them) - a few test fo files showing some properties (text-indent, text-align-last, word-spacing & letter-spacing) - an fo file with some long paragraphs, which can be used to make space width comparison; for example, HEAD and xep have very large spaces in some lines of the first paragraph. I try to summarize briefly the algorithm, but I'm not so sure I can be both clear and exhaustive :-) The first step in the algorithm is converting the paragraph into a sequence of elements; there are 3 kind of elements: - "box" elements, representing words or whatever else; a box has a width - "glue" elements, representing adjustable spaces between boxes; a glue has an optimal width, a potential stretch and a potential shrink, i.e. its length is in the range [width - shrink, width + stretch] - "penalty" items represents positions in which a line could be broken; a penalty has a width (i.e. the "-" width when hyphenating), a penalty value and a boolean; the penalty value is a kind of "aesthetic cost", and it could be a finite number, +infinite (where there can't be a break) or -infinite (a forced break); penalties with value = 0 can be omitted For example, the text "life is beautiful" will generate the sequence: box - glue - box - glue - box - penalty - box - penalty - box ^ ^ ^ ^ ^ ^ ^ | | | | | | | life is beau - ti - ful Then, the algorithm finds the breaking points: For each element in the paragraph, it verifies if this element is a "feasible break", i.e. there is a previous "active" feasible break such that the elements between the two break can fill a line, using the available stretch and shrink. If the element is a feasible break, the algorithm computes a "score" depending on the needed adjustment and the penalty (the lower is the score, the better is the break), and create a structure representing this element as an active feasible break. Note that the number of active feasible break at each moment is quite small, as feasible breaks are deactivated if they are too close to the new one. Some implementation notes: 1) In order to minimize changes in other files, the LLM interacts with the TLM calling brand new methods that I added to the LayoutManager interface, providing null implementation in the AbstractLayoutManager. I also added a KnuthPossPosIter, which is almost a clone of BreakPossPosIter. I think a much more elegant solution would be having the LLM call getNextBreakPoss, and casting the returned object to KnuthElement. Maybe BreakPoss and KnuthElement could be subclasses of a same class, abstract, containing only a Position object and the method to access it: BreakPoss (same name, but a different class) | --------------- | | KnuthElement TheClassPreviouslyKnownAsBreakPoss All LM call getNextBreakPoss, but the LLM casts to KnuthElement and the other LM cast to "TheClassPreviouslyKnownAsBreakPoss" 2) The letter space is at the moment set to .min = .opt and .max = .opt as there are a few problems with adjustable letter space (Knuth's original algorithm does not handle letter space); I also have some problems in counting the letter spaces in a hyphenated word. 3) The LLM tries the first time to find breaking point without word hyphenation; only if it doesn't find a set of break points, it calls again the algorithm after having hyphenated all words, so this patch includes the ones concerning hyphenation (bug 27773) and hyphenation of word with punctuation marks (bug 28431). I tried to use the old information (about the whole words) as much as possible, but I don't know whether this saves some time or it is just an unneeded complication. 4) Concerning word and letter spacing, one problem is that at the moment I see no way to decide whether the fo property letter-spacing (or word-spacing) was set to 0 (and fop should respect this value), or it wasn't set and 0 is its default value (and fop could modify it in order to justify). Comments, suggestions, etc. are most welcome! Regards Luca
Created attachment 11602 [details] patch to existing files
Created attachment 11603 [details] KnuthElement.java
Created attachment 11604 [details] KnuthBox.java
Created attachment 11605 [details] KnuthGlue.java
Created attachment 11606 [details] KnuthPenalty.java
Created attachment 11607 [details] KnuthPossPosIter.java
Created attachment 11608 [details] fo test file (text-indent)
Created attachment 11609 [details] fo test file (text-align-last)
Created attachment 11610 [details] fo test file (word-spacing and letter-spacing)
Created attachment 11611 [details] fo test file (long paragraphs of text)
Created attachment 11625 [details] LineLayoutManager.java
Created attachment 11626 [details] TextLayoutManager.java
Hi all At long last, I have finished the patch implementing Knuth's line breaking algorithm; it took me more than I expected, mainly because of a long sequence of hw and sw troubles ... Murphy's laws are not something to laugh at! :-) I have worked on [Line, Text, InlineStacking, LeafNode]LM, so the algorithm should work well with any fo file containing text, leaders, characters, inlines and the other formatting objects handled by a LeafNodeLM (external graphics, pagenumbers and citations). The general idea of Knuth algorithm is: try to find breaking points without hyphenating words if this fails hyphenate all words try again The "hyphenate all words" phase could be time-expansive, so this step is performed trying to use as much as possible the information already known, and to minimize the changes to the existing sequence of elements. The old sequence is used to collect word fragments, and elements are replaced only if the LM which created them has something to change. So, "hyphenate all words" means: scan the old sequence once: collect word fragments hyphenate word scan the old sequence once more: if the LM which returned this element has changed something replace all elements returned by this LM These are the new methods added; at the moment, I added them to the LayoutManager interface, but maybe it could be better to create a new interface implementd only by LM returning inline areas. + getNextKnuthElement() This is used instead of getNextBreakPoss(). The next step (I have already started working on it) would be to use the same method all LM use. + addALetterSpaceTo() The low-level LMs (TLMs, LNLMs) have only a partial view of the text, and therefore cannot know the exact number of letter spaces, while the LineLM has a full view. If a TLM's text is "Tex", it can only suppose it has 2 letter spaces; if the following formatting object is a character "t", the LineLM tells the TLM to add a letter space, as the "x" is not the last letter of the word. + getWordChars() This is not a new method, it just has different parameters; text is collected from fo:characters too. + hyphenate() The TLM does not apply the changes to vecAreaInfo immediately, otherwise the existing Position objects stored in the old sequence couldn't be used any more. The LeafNodeLM returns a single area, so it can apply changes immediatly. + applyChanges() This method tells the TLM to apply the changes to vecAreaInfo; all LM returns true if something is changed or false otherwise, so the LLM knows whether it has to replace the old elements or not. + getChangedKnuthElement() This is used by the LLM to obtain the new elements. + getWordSpaceIPD() This is used by the LLM to ask for the word space dimension; the LLM needs it to center text. A few details to fix: - word spacing and letter spacing are now fully implemented, they can both have MinOptMax values; but I am still thinking about how to differentiate a user-defined zero value from a default zero value ... - Leaders with leader-pattern = "rule" or "space" work well; with "dots" the space left is right, but the dots don't fill it properly. Leaders with leader- pattern="use-content" don't work, as the ContentLayoutManager has at the moment only a null implementation of the method getNextKnuthElement. There is also a minor bug concerning (IMO) white space handling: if there white space both before and after the leader, the latter one is removed, so instead of word ______ word the output shows word ______word - with the other fo elements (fo:externalgraphic, fo:page-number and fo:page- number-citation) the LeafNodeLM behave exactly the same way as with the old code, i.e. a fo:page-number-citation generates a "? ". - text-align-last is partially implemented; text-align-last = "justify" works only if text-align = "justify" too; this is because Knuth's algorithm doesn't provide for a different alignment for the last line. I'm going to attach: - the patch to existing files and new files - a test fo file
Created attachment 12308 [details] patch (second edition)
Created attachment 12309 [details] test fo file
I just realized that in the patch there is a line commented out, while it shouldn't be. It's in the LineLayoutManager, in the method getNextBreakPoss(); the code at the moment is: ... if (true) { //if ((iBPcount = findBreakingPoints(currPar, context.getStackLimit().opt, maxAdjustment)) == 0) { ... The commented line should be uncommented, and the "if (true) {" removed. Anyway, the program works all the same. Sorry for the oversight! Luca
Thanks, you will need to add the comments in this patch to the *code*, people will not always have the benefit of this Bugzilla entry when looking at it. This is extremely important, as layout is very complex. Also, is this Knuth algorithm copyrighted? Where did you get it from? It is rare that we have classes named after authors. Glen
Thanks again for you work--we have very few who can work on layout. Apparently, Dr. Knuth wouldn't seem to mind using his algorithms: http://lpf.ai.mit.edu/Patents/knuth-to-pto.txt
Luca, I also thank you for your time and commitment. Grazie mille^H^H^H^H^H un miliardo! I'm sure it is very much appreciated by the other COMMITTERs, as well as the throngs who will benefit from your time and energy in the future. This is a very exciting addition to FOP, and I'm hoping it will help to simplify the code in other ways as well. It's really nice to have a multitude of people who 'capish' (grok) the inner workings of FOP. Web Maestro Clay
I'm going to attach the corrected patch, sorry again! Knuth's algorithm is described in the essay D. E. Knuth and M. F. Plass, "Breaking paragraphs into lines" and I found it in the book D. E. Knuth, "Digital typography", published by CSLI Publications Unfortunately, I couldn't find any link to an on-line version of this essay. As regards the names of the classes, they were mainly devised to detect quickly the new files among the others! So, it's not a problem for me to change them :-) Luca
Created attachment 12345 [details] patch file (third edition, minor oversight fixed)
Luca, Your patch applied cleanly to a checkout from CVS dated 1 August 2004, but I get build errors. Could it be that differences to HyphContext are not included in the patch? [javac] src/java/org/apache/fop/layoutmgr/LineLayoutManager.java:1081: cannot resolve symbol [javac] symbol : constructor HyphContext (int[],int) [javac] location: class org.apache.fop.layoutmgr.HyphContext [javac] return new HyphContext(hyph.getHyphenationPoints(), sbChars.length()); [javac] ^ [javac] src/java/org/apache/fop/layoutmgr/TextLayoutManager.java:425: cannot resolve symbol [javac] symbol : method isWordEnd () [javac] location: class org.apache.fop.layoutmgr.HyphContext [javac] newIPD.add(MinOptMax.multiply(letterSpaceIPD, (hc.isWordEnd()? [javac] ^ [javac] src/java/org/apache/fop/layoutmgr/TextLayoutManager.java:437: cannot resolve symbol [javac] symbol : method isWordEnd () [javac] location: class org.apache.fop.layoutmgr.HyphContext [javac] (short)0, (short)(hc.isWordEnd()? (iStopIndex - iStartIndex - 1): (iStopIndex - iStartIndex)), [javac] ^ [javac] 3 errors Simon
I'm going to attach an updated patch, including HyphContext (which I forgot to include in the previous versions, sorry) and a few changes to fix a couple of bugs. I used linux's diff between the modified files and the original ones (updated yesterday, 11 August); for some reasons (maybe I use some wrong options) wincvs's diff did not include new files and did not use the latest version of the original files, so finding lots of difference due to recent cvs commits. I hope I did not forget anything this time! :-) Luca
Created attachment 12400 [details] patch file (fourth edition, including HyphContext and bug fixes)
Hi Luca, I have problems with the results for text-align="start", "end" and "center" in your test FO file. The lines are too long, and the first line ends with a space. For start and end a paragraph is dropped. Herewith the area tree output for start/end and center: <block width="288000" ipd="288000" height="30800" props="border-start:(87,#ffff00,1000);break-after:8;border-end:(87,#ffff00,1000);border-after:(87,#ffff00,1000);border-before:(87,#ffff00,1000);break-before:8;"> <lineArea height="14400"> <text twsadjust="0" tlsadjust="0" props="font-size:12000;font-family:F1;color:#000000;">Poche corte parole molto corte, in modo che tutto vada bene e non ci siano guai. Qualche </text> </lineArea> <lineArea height="14400"> <text twsadjust="0" tlsadjust="0" props="font-size:12000;font-family:F1;color:#000000;">tra parola per fare tre righe.</text> </lineArea> </block> <block width="288000" ipd="288000" height="30800" props="border-start:(87,#0000ff,1000);break-after:8;border-end:(87,#0000ff,1000);border-after:(87,#0000ff,1000);border-before:(87,#0000ff,1000);break-before:8;"> <lineArea height="14400"> <text twsadjust="0" tlsadjust="0" props="font-size:12000;font-family:F1;color:#000000;">Poche corte parole molto corte, in modo che tutto vada bene e non </text> </lineArea> <lineArea height="14400"> <text twsadjust="0" tlsadjust="0" props="font-size:12000;font-family:F1;color:#000000;">ci siano guai. Qualche altra paro</text> <text twsadjust="0" tlsadjust="0" props="font-size:12000;font-family:F1;color:#000000;">la per fare tre righe.</text> </lineArea> </block> Regards, Simon
Oops, sorry again, Simon! In the code I used when creating the last patch there was an error affecting the TextLayoutManager.getChangedKnuthElements() method: two missing "break" inside a switch. Due to this error, the sequence of elements generated for each space (when text-align is center, start or end) is wrong, and some text "disappears" (I even got IndexOutOfBounds exceptions). Inserting these breaks is enough to make everything work: ... // ai refers to a space switch (alignment) { case CENTER : ... iReturnedIndex ++; break; /* this was missing */ case START : // fall through case END : ... iReturnedIndex ++; break; /* this was missing */ case JUSTIFY: ... } ... As you can see, in the last patch I changed the getNextKnuthElements() and getChangedKnuthElements() return type, so they now return a sequence of elements instead of a single one. This maybe reduces similarities between getNextKnuthElements() and getNextBreakPoss(), but I think it makes the code simpler and easier to understand. Maybe it would be even better to make them return the whole sequence, so that these methods are called once per LM. Now I'm working on the newly-created LMs, so next patch (which I think will be ready tomorrow) will apply to the latest code version and will include Finn Bock's changes. Regards, Luca
Hi again I have updated the patch, so it now includes the latest changes and the new LayoutManagers (CharacterLM and LeaderLM). I also fixed the text-align-last property, so it now works well even if text- align is not "justify". Regards, Luca
Created attachment 12488 [details] patch - existing files (fifth edition)
Created attachment 12489 [details] patch - new files (fifth edition)
The last patches do not apply without errors: - Last lines of patch files do not end with an newline. - Diff of area/inline/TextArea.java is incomplete. - Compile error in render/xml/XMLRenderer.java. I did manage to fix all problems.
Luca, This is a good piece of work again. Nevertheless, I did succeed in breaking the code. See the following test fo files. Nested inline and other LMs: The output contains errors, see the comments in the text. The errors occur when hyphenation is set to true. Justification: This is a test fo you submitted earlier. According to the text in the file the second block should be hyphenated; it is not. Should it still be hyphenated, or can this not be enforced with the Knuth algorithm and text-align="start"? No breakpoints: An exception is thrown, at LineLayoutManager.getNextBreakPoss(LineLayoutManager.java:495). It occurs because breakpoints has size 0; the third call to findBreakingPoints also returned 0. This should not be possible; the algorithm should always return a breakpoint. A few small remarks: Can you move the following log messages to trace log level: [DEBUG] AbstractLayoutManager - - Word to hyphenate: We In TextLM, returning null for a forced LF is not an idea that I like, because it overloads the null return value. Cannot you return an special Knuth element for LF? Alternatively, you could return null and process the paragraph. The second paragraph would then be produced and processed later. InlineStackingLM.getNextKnuthElements: 'if (lc.startsNewArea())' no longer used? Regards, Simon
Created attachment 12492 [details] test fo: Nested inline and other LMs
Created attachment 12493 [details] test fo: Justification
Created attachment 12494 [details] test fo: No breakpoints
I'm going to attach still another version of the patch :-), corrected according to Simon's comments. The new files (Knuth*.java) are unchanged, so I don't attach them. Regards, Luca
Created attachment 12518 [details] patch to existing files (sixth edition)
Luca, The build failed. That is probably due to this line in the patch: cvs server: I know nothing about layoutmgr/LeafNodeLaoyutManager.java Simon
Sorry again, I didn't notice that cvs error. I'm going to attach the right (?) patch :-) (for some strange reasons wincvs' diff shows Simon's latest changes too, I hope this is not a problem ...) Luca
Created attachment 12560 [details] patch to existing files (version 7.1)
I applied the latest cvs commit and re-created the patch using linux's diff. Then I succeeded in applying the patch, compiling and running a few fo tests, so I'm (almost) sure this time it will work! :-) Sorry again for the long delay Luca
Created attachment 12566 [details] patch to existing files (tested)
Luca, Thanks for the new patch. I could apply it without problems, and testing it goes well. You mention that you have not implemented the Knuth algorithm for ContentLM. Would it be difficult to do that? FOP team, If I would apply this patch, we would get the following regressions: - ContentLM does not show its content. A leader with leader-pattern="use-content" results in a blank area of the right size. - When for an exceptionally difficult paragraph no set of breaking points can be found, the whole paragraph is printed on a single line. This occurs, for example, when in a narrow typesetting width only a single word or a part of it fits in a line. I am working towards applying this patch despite these regressions, for these reasons: - This patch is a good piece of work, and a step forward for FOP's layout. - It becomes increasingly hard to maintain this patch outside of CVS. Please, speak up if you are against this. Simon
<Q> I am working towards applying this patch despite these regressions, for these reasons: - This patch is a good piece of work, and a step forward for FOP's layout. - It becomes increasingly hard to maintain this patch outside of CVS. Please, speak up if you are against this. </Q> Simon, I have not had the time to be following this issue much so will be deferring to your judgment. Thanks, Glen
Luca, Patch applied. Thanks for this innovative and extensive contribution. Still to be done: - Resolve the regressions mentioned above. - I support the idea to create an InlineLayoutManager interface, which extends LayoutManager. - This patch has made a lot of existing code redundant. Much of that code is still present. To keep the code clean and intelligible, the redundant pieces should be removed at some time by somebody. I have added a space after casts. See the style guidelines in the file dev/conventions.html of the FOP web site. I have a few remarks about the code. I leave it to you to follow these up or not, but I would like to see point 1 addressed: 1. In TextLM: // linefeed; this can happen when linefeed-treatment="preserve" // the linefeed character is the first one in textArray, // so we can just return a list with a penalty item In LineLM: if (returnedList.size() > 1 || !(thisElement.isPenalty() && ((KnuthPenalty)thisElement).getP() == -KnuthElement.INFINITE)) { } else { // a list with a single penalty item whose value is -inf // represents a preserved linefeed, wich forces a line break Can we be sure that U+A is always alone or the first item in a textArray; does this not depend on the Parser, how it calls the SAX characters method? 2. In InlineStackingLM.applyChanges: Falling over the end of oldListIterator can be done better: treat only currLM != prevLM in the loop, treat !oldListIterator.hasNext() after the loop. Same for getChangedKnuthElements? : while(oldListIterator.hasNext()) { oldElement = (KnuthElement)oldListIterator.next(); currLM = oldElement.getLayoutManager(); // initialize prevLM if (prevLM == null) { prevLM = currLM; } if (currLM != prevLM) { bSomethingChanged = prevLM.applyChanges(oldList.subList(fromIndex, oldListIterator.previousIndex())) || bSomethingChanged; prevLM = currLM; fromIndex = oldListIterator.previousIndex(); } } bSomethingChanged = currLM.applyChanges(oldList.subList(fromIndex, oldList.size())) || bSomethingChanged; Possible cases, after the loop: xxyy or yy, xx done prevLM = currLM = y fromIndex = last done (2 and 0) 3. In InlineStackingLM: Unnecessary differences between treatment of returnedList and returnList in getNextKnuthElements and getChangedKnuthElements. In getChangedKnuthElements it is not necessary to have a separate returnedList and returnList. 4. Break up long methods in LineLM: findHyphenationPoints, getNextBreakPoss, considerLegalBreak (?), findBreakingPoints (?). Regards, Simon
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed