28431 – [PATCH] Hyphenation of words with punctuation marks

Bug 28431 - [PATCH] Hyphenation of words with punctuation marks

Summary: [PATCH] Hyphenation of words with punctuation marks

Status:	CLOSED FIXED

Alias:	None

Product:	Fop - Now in Jira
Classification:	Unclassified
Component:	page-master/layout (show other bugs)
Version:	trunk
Hardware:	PC Windows XP

Importance:	P3 normal
Target Milestone:	---
Assignee:	fop-dev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-04-16 13:29 UTC by Luca Furini
Modified:	2012-04-01 06:25 UTC (History)
CC List:	0 users

Attachments
proposed patch to HyphenationTree (2.12 KB, patch) 2004-04-16 13:29 UTC, Luca Furini	Details \| Diff
test fo file: words with punctuation marks and parenthesis (1.30 KB, text/plain) 2004-04-16 13:30 UTC, Luca Furini	Details
An expanded test fo file (2.13 KB, text/xml) 2004-04-16 19:45 UTC, Simon Pepping	Details
A slightly modified patch (3.89 KB, patch) 2004-04-16 19:46 UTC, Simon Pepping	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luca Furini 2004-04-16 13:29:05 UTC

I have found a small bug concerning hyphenation in the 
HyphenationTree.hyphenate() method.
Before checking the exception list or using the algorithm, the 
function "normalizes" the word: during this phase, if a non-letter character 
is found null is returned.
    // normalize word
    char[] c = new char[2];
    for (i = 1; i <= len; i++) {
        c[0] = w[offset + i - 1];
        int nc = classmap.find(c, 0);
        if (nc < 0) {    // found a non-letter character, abort
            return null;
        }
        word[i] = (char)nc;
    }
I think the condition (nc < 0) is too strong: at the moment words followed by 
punctuation marks, or in parenthesis, are not hyphenated.
So, for example, the word "suggestion" can be hyphenated, but "suggestion." 
and "(suggestion)," cannot.

This is how I tried to fix this problem:
- non-letter characters at the beginning are not copied into word[]
- if a non-letter character is found which is not at the beginning, it is not 
copied into word[] and a boolean variable becomes true
- if a letter-character is found when the variable is true, null is returned; 
otherwise, word[] is used to find hyphenation points

I have also added a little optimization: if, after the normalization and the 
non-letter character removal, the word size is less than (remainCharCount + 
pushCharCount), null is returned, without checking the exception list and 
performing the algorithm.

I'm going to attach the proposed patch and a test fo file which shows a few 
examples.

Regards

    Luca

Comment 1 Luca Furini 2004-04-16 13:29:53 UTC

Created attachment 11258 [details]
proposed patch to HyphenationTree

Comment 2 Luca Furini 2004-04-16 13:30:16 UTC

Created attachment 11259 [details]
test fo file: words with punctuation marks and parenthesis

Comment 3 Simon Pepping 2004-04-16 19:44:15 UTC

Luca,

The patch works well.

I do not find the name bAfterLetter very clear. It really is
bNonLetterAfterLetters, but that is too long. I find bEndOfLetters a
reasonable choice.

The 'else if (!bAfterLetter)' might as well be just 'else'.

The venom is in the tail. I do not know the details of this part of
hyphenation. Your addition of 'iIgnoreAtBeginning' seems OK. I think
you should also add 'iIgnoreAtBeginning' in the if branch (hyphenation
exceptions), but the results of a test fo are not quite in
favour. Perhaps you can have a look into this.

I added a long comment explaining various features, perhaps most to
myself.

I added cases to the test fo showing a word that is too short (when
one adds debug logging, one sees the effect), and 4 cases with a
hyphenation exception word.

Regards, Simon

Comment 4 Simon Pepping 2004-04-16 19:45:30 UTC

Created attachment 11264 [details]
An expanded test fo file

Comment 5 Simon Pepping 2004-04-16 19:46:39 UTC

Created attachment 11265 [details]
A slightly modified patch

Comment 6 Glen Mazza 2004-04-17 05:17:50 UTC

Your assumptions appear correct, I checked the Washington Post newspaper and saw
that hyphenation does indeed occur with words that have a period or comma at the
end of them.

Glen

Comment 7 Luca Furini 2004-04-19 10:30:35 UTC

Simon,

concernig names, unnecessary "if", etc. , I agree with you.

It seems to me that your change concerning hyphenation exceptions works, 
otherwise the hyphenation points would appear in the wrong place because of 
the punctuation marks.

The strange pdf generated is due, IMO, to a couple of problems:

-1-
In the last test case the text is (quite oddly) divided among 3 TextLM
 "**[...]** (philanthrop"
 "ic)."
 " "
Specifying the property linefeed-treatment="ignore", the text is all in a TLM.
Removing from the test file the linefeed after "(philanthropic).", the text is 
still split in two parts:
 "***************************"
 "*************************************** (philanthropic)."
So, it seems there is an irksome bug affecting text splitting.

-2-
The last line in a justified paragraph is sometimes justified too (bug 28314).
The "phantom linefeed" is by default treated as a space, and so it is adjusted.
Anyway, I was pleased to notice that, although shattered, the word is 
correctly collected and hyphenated.

Regards

    Luca

Comment 8 Simon Pepping 2004-04-26 19:35:42 UTC

Luca,

I agree with item 1. Although the spaces before the hyphenated word
are often incorrect, the word is hyphenated correctly, and the patch
solves the problem it set out to solve. 

I also agree with item 2. The break up of the word is incorrect. Also
the spaces before the hyphenated word are often incorrect. However,
this problem is not caused by the patch, it is just revealed by it.

The patch can be applied. The break-up and spacing problems should be
solved in a separate effort.

Simon

Comment 9 Simon Pepping 2004-05-29 09:20:31 UTC

Patch applied. Thanks.
Simon

Comment 10 Glenn Adams 2012-04-01 06:25:55 UTC

batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed