Reported by Gilles Beaugeais on fop-users: At first glance, it seems like this could be a regression due to the UAX#14 linebreaking, but that is a wild guess. I'll investigate the attached sample, and run it through the debugger later, if nobody beats me to it. I've already added the same block without hyphenation, and giving this a quick run, it seems the problem wrt dashes being ignored as possible linebreaks is unrelated to hyphenation settings...
Created attachment 19622 [details] Example FO demonstrating the problem
Yes the changed behaviour is due to the UAX#14 changes but as far as I can tell the new behaviour is in line with the UAX#14 spec (http://www.unicode.org/reports/tr14/). Rule 21 says: LB21 Do not break before hyphen-minus,.... and Rule 25 prevents a linebreak between a hyphen followed by a digit This means the only legal breakpoint in the text in question 'C-12-188-440/NH-000' is the forward slash which is the one FOP chooses. You could surround the hyphen with ZWSP or use an EM DASH instead of the HYPHEN to generate a line breaking opportunity. I have changed the bug to INVALID but feel free to disagree.
(In reply to comment #2) > Yes the changed behaviour is due to the UAX#14 changes but as far as I can tell > the new behaviour is in line with the UAX#14 spec > (http://www.unicode.org/reports/tr14/). > > Rule 21 says: LB21 Do not break before hyphen-minus,.... > and > Rule 25 prevents a linebreak between a hyphen followed by a digit > > This means the only legal breakpoint in the text in question > 'C-12-188-440/NH-000' is the forward slash which is the one FOP chooses. > > You could surround the hyphen with ZWSP or use an EM DASH instead of the HYPHEN > to generate a line breaking opportunity. > > I have changed the bug to INVALID but feel free to disagree. I think you're right, actually, a hyphen character shouldn't be used in such a case. That just reminds me of something I saw in a book on typographic rules, that the proper character to use here is the en dash (U+2013), like in date ranges (e.g., 2001-2005). I guess a break would be allowed then. Now that UAX#14 is implemented, illegal uses of hyphens will start to strike out. Let's get prepared to teach people about the right use of the several dash characters: hyphen, en dash, em dash, quotation dash, etc. A new time of high-level typography has risen...
Of course 'high-level typography' doesn't really help you much if what you need to do is generating invoices, orders, ... or the like from existing datasources, e.g. databases, in which for example order numbers or item numbers are stored in the good old ASCII character set using the hyphen. Lets see if this breaking behaviour becomes a trouble spot down the track. There is always the option, and the spec explicitly allows that for certain rules, to make this somehow configurable.
Thanks for the explanations. I didn't know this rules. (The specs appear a little 'strange' to me; the basic space character is used as a break, whereas the entity nbsp is used to keep words together but basic hyphen character is used to keep when numbers, whereas the entity endash is used to break. It is very confusing !) It is very annoying for me, having thousands of XML files written with hyphens. And it is impossible to ask users to insert endash instead of hyphens, it is time consuming and endash display is different from hyphen display (with a font like Arial). So could someone tell me if it is easy (and where if possible) to modify the source code to make the behavior of hyphen the same as the endash or emdash. Thanks again for your help and your work on FOP,
Its just software so anything is possible :-) Historcally the hyphen is one of those characters which is vastly overloaded with different meaning in different contexts. The UAX#14 spec has taken one particular approach to its interpretation which admittedly doesn't match well in some legacy situations, i.e. situations in which the hyphen is used in a context different to what the Unicode standard expect it to be used in. The quick and dirty hack (untested of course) would be to add in org.apache.fop.text.linebreak.LineBreakStatus in the method nextChar() at the beginning something like: if (c == '-') c = '\u2013'; That is whenever we give a hyphen to the line breaker convert it into a EN DASH before the line breaker deals with it. A probably better solution, but requiring some understanding of the UAX#14 spec, would be to change the actual pair table in src/codegen/unicode/data/LineBreakPairTable.txt and to regenerate the java code using the codegen-unicode ant target. For example change the cell (row HY / col NU) from % (indirect break opportunity) to _ (direct break opportunity). That would allow a break between a hyphen followed directly by a numeric. Hope this helps.
Thanks, I will try your ideas. I prefer the second one too as the hyphen remains an hyphen instead of being converted to an endash in the first one. And the second one applies only in table cells.
No the second one does not only apply in table cells - it will apply everywhere. In the 2nd case you are modifing the 'line breaking pairs table' not line breaking in tables.
batch transition to closed remaining pre-FOP1.0 resolved bugs