Before committing the work I did on Arabic to the trunk, the Apache FOP organization seems to want five things:
1) Modify ICU4J change to check if classes available and if not don't call them
2) Provide Apache organization with performance data to assess performance cost of Arabic Shaping classes
3) Provide Apache organization with better examples of use of Arabic
4) Move Arabic form shaping and BIDI algorithm to layout manager
5) Not use ICU4J to do UNICODE transformation but use the standard ligature mechanism to provide contextual glyphs. This is a request for a complete rewrite of the patch to use a mechanism that isn't known to me currently, but maybe could become known if I had the right pointers.
(4) is highly non-trivial. I haven’t a clue as to how to do (5).
For (4) could you point me at the source code files in the layout manager that would have to be changed? Can you give me some pointers as to where this sort of information is processed by the layout manager? I've read the layout manager code and tried to locate where it processes the width of characters and what would have to change to have right-to-left printing but I've been unable to penetrate the forest for the trees. I have read Knuth's algorithm for line breaking and I think I have a good understanding of what a KnuthElement is - glue, penalties and the basics of Knuth's algorithm, but I'm having trouble converting this theoretical understanding into a practical understanding of what has to change in the code to move the printing from right to left.
I’m not sure what to do about (5). Do you have any references, is there some pointer to what algorithm would do more than UNICODE transformation but would do contextual glyphs based on the glyphs in a font. How do I tell the characteristics of an Arabic character in a font, whether it is in initial, intermediate or final position? I suppose this information would vary from font to font. Where in FOP is font information like this processed and how do I “tell” a font I want the Arabic character at UNICODE position X but I want from the font that the character be in final position? Does the layout manager actually process the font information about a character? I suppose it must to know character widths, which are necessary for Knuth's algorithm, but please forgive me, I don't see where this code lives. FOP has over 11,000 files!
I used ICU4J to avoid having to write a ton of code. That is why my patch is so small.
I'm not complaining. I'm hoping I can get some more pointers to what changes need to be made to support Arabic and where the changes have to go. Even if I'm not the one who eventually does the work, whoever eventually implements right-to-left printing and Arabic support will certainly find our discussion valuable. I'm sure you'll agree that FOP needs to become truly international at some point. That would really open a new community of users to the benefits of FOP, which are considerable.
In fact, I agree that it is hard to see how there can be a robust solution to the problem of printing Arabic text that simply involves the PDF renderer; theoretically and probably practically the layout manager has to be involved.
I’ve looked at the FOP SVG rendering code which tries to do Arabic form shaping and it seems to be just doing UNICODE transformations. It doesn’t seem to be responding to the ability of a font that you are discussing, to display a single UNICODE code in many different forms. It seems to be just doing a simple table look up that transforms a UNICODE code. So you already have code in FOP, in your SVG renderer, that seems to do the same thing I tried to do using ICU4J. This doesn't mean the code I wrote using ICU4J is doing the right thing, but it does mean that simply transforming one UNICODE code to another is the simplest first step in solving this difficult problem.
Could we agree that we could live with an ICU4J approach if (1),(2),(3), and (4) were met as conditions, and that (5) - a complete rewrite using modern font techniques could be deferred. Of course, I'm interested in learning how I could achieve (5); I'm not dismissing (5), I'm just looking for a bottom-line that would allow FOP to practically meet the needs of rendering Arabic text, even if the result isn't perfect yet.
(In reply to comment #11)
> Hi Jonathan,
> (In reply to comment #8)
> > Hi Vincent,
> > I will attach the .fo file I've been using for testing. I will also attach the
> > generated pdf. This is from an example our Dubai team gave me for my own
> > testing as I developed the code.
> Well... It's a bit light for an example. Just a single word...
> > Our Dubai team has been testing with a large variety of Arabic script - but
> > they are using a report creation tool that invokes fop.bat with xsl input so
> > the .fo file isn't part of their output.
> > I could give them instructions for creating .fo files.
> > We have found in testing that what is most important is the BIDI algorithm is
> > applied so that text (including embedded numerals) is in the right order and
> > that form shaping is correct. You need to know the Arabic alphabet and its
> > rules to assess the output of testing. We have a team that knows Arabic to do
> > our testing. They "eyeball" the reports to make sure they are in proper Arabic
> > with text and sub-text in the right order. Embedded numerals can be in a
> > different order - left-to-right rather than right-to-left. It isn't clear to me
> > how this process can be automated.
> > You are right that widths change and this could change line breaking decisions.
> > Do you know where in the FOP pipeline before we reach the rendering pipeline
> > the Arabic shaping could go so as to be able to affect width selection?
> Something needs to be done in the layout engine, possibly also on the FO tree.
> At least section 5.8 (“Unicode BIDI Processing”) of XSL-FO 1.1 deserves a look
> as it explains how the Unicode algorithm should be blended in XSL-FO
> processing. Inline-level stuff is likely to be affected. It needs to be seen
> how and when character re-ordering should be done WRT line breaking.
> Also, something might need to be done at the font level. I don't know what
> ICU4J does, but I suspect it replaces characters from the Arabic range
> (U+0600–U+06FF) with ones from Arabic Presentation Forms-A (U+FB50–U+FDFF).
> AFAIU from the Unicode specification this is legacy that may not be supported
> by every font. I suppose modern fonts (especially OpenType ones) use the
> standard ligature mechanism to provide contextual glyphs.
> > I believe that what ensures the right glyphs are embedded in the PDF file is
> > the nature of the ICU4J algorithm which transforms the UNICODE representation
> > of the string. The output for our Dubai team is PDFs with embedded fonts and
> > these are working so ICU4J must have solved the problem in some way, and I
> > believe the way they solve it is by using different UNICODE codes.
> Actually this is taken care of by the font library called by PDFPainter. I
> suspect the same is done at the layout stage, with the standalone glyphs. Which
> would be suboptimal, as both standalone and contextual glyphs would be embedded
> in the final PDF.
> > I don't have performance numbers to give you yet. If ICU4J was clever about
> > the way they wrote their transform algorithm it should not be much of a
> > performance impact since they only need to transform text in the Arabic UNICODE
> > code range and testing whether text is in this range should be quick.
> > Thanks,
> > Jonathan
> > (In reply to comment #7)
> > > Hi,
> > > Thanks for your patch. Do you have an example FO file that could be used for
> > > testing purpose (even better, with an English translation)?
> > > IIUC, Arabic shaping is about replacing glyphs for standalone letters with
> > > suitable ligature glyphs for building words. Surely that affects character
> > > widths, so line breaking decisions? In the patch, shaping is performed at the
> > > rendering stage, so isn't there a danger of getting inconsistent results?
> > > Also, IIC Arabic shaping affects glyphs selection. How do you make sure that
> > > the right glyphs are being embedded in the PDF file?
> > > The same piece of code is duplicated in the PCL and PDF painters. The same
> > > would probably also need to be done for other painters. This is not desirable.
> > > Finally, what is the impact on performance? It looks like shaping will be
> > > applied to just any text, even non-arabic one.
> > > Thanks,
> > > Vincent
> > > (In reply to comment #3)
> > > > Created an attachment (id=24934) [details] [details] [details] [details]
> > > > Support for Arabic PDF rendering using ICU4J
> > > >
> > > > This patch uses ICU4J to do form-shaping and BIDI transformation of rendered
> > > > text. It is a patch for the FOP trunk. It does not change the layout manager
> > > > or the area tree handler or allow a writing-mode other than “lr-tb”. For this
> > > > patch to be integrated with FOP, FOP would need to distribute the ICU4J library
> > > > - icu4j-4_2_1.jar. It affects both PDF and PCL rendering but has only been
> > > > tested with PDF rendering. So far results of testing with PDF rendering have
> > > > been positive. The PCL aspect of the patch looks correct given that the PDF
> > > > aspect works.