Bug 28431 - [PATCH] Hyphenation of words with punctuation marks
Summary: [PATCH] Hyphenation of words with punctuation marks
Status: CLOSED FIXED
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: page-master/layout (show other bugs)
Version: trunk
Hardware: PC Windows XP
: P3 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-04-16 13:29 UTC by Luca Furini
Modified: 2012-04-01 06:25 UTC (History)
0 users



Attachments
proposed patch to HyphenationTree (2.12 KB, patch)
2004-04-16 13:29 UTC, Luca Furini
Details | Diff
test fo file: words with punctuation marks and parenthesis (1.30 KB, text/plain)
2004-04-16 13:30 UTC, Luca Furini
Details
An expanded test fo file (2.13 KB, text/xml)
2004-04-16 19:45 UTC, Simon Pepping
Details
A slightly modified patch (3.89 KB, patch)
2004-04-16 19:46 UTC, Simon Pepping
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Furini 2004-04-16 13:29:05 UTC
I have found a small bug concerning hyphenation in the 
HyphenationTree.hyphenate() method.
Before checking the exception list or using the algorithm, the 
function "normalizes" the word: during this phase, if a non-letter character 
is found null is returned.
    // normalize word
    char[] c = new char[2];
    for (i = 1; i <= len; i++) {
        c[0] = w[offset + i - 1];
        int nc = classmap.find(c, 0);
        if (nc < 0) {    // found a non-letter character, abort
            return null;
        }
        word[i] = (char)nc;
    }
I think the condition (nc < 0) is too strong: at the moment words followed by 
punctuation marks, or in parenthesis, are not hyphenated.
So, for example, the word "suggestion" can be hyphenated, but "suggestion." 
and "(suggestion)," cannot.

This is how I tried to fix this problem:
- non-letter characters at the beginning are not copied into word[]
- if a non-letter character is found which is not at the beginning, it is not 
copied into word[] and a boolean variable becomes true
- if a letter-character is found when the variable is true, null is returned; 
otherwise, word[] is used to find hyphenation points

I have also added a little optimization: if, after the normalization and the 
non-letter character removal, the word size is less than (remainCharCount + 
pushCharCount), null is returned, without checking the exception list and 
performing the algorithm.

I'm going to attach the proposed patch and a test fo file which shows a few 
examples.

Regards

    Luca
Comment 1 Luca Furini 2004-04-16 13:29:53 UTC
Created attachment 11258 [details]
proposed patch to HyphenationTree
Comment 2 Luca Furini 2004-04-16 13:30:16 UTC
Created attachment 11259 [details]
test fo file: words with punctuation marks and parenthesis
Comment 3 Simon Pepping 2004-04-16 19:44:15 UTC
Luca,

The patch works well.

I do not find the name bAfterLetter very clear. It really is
bNonLetterAfterLetters, but that is too long. I find bEndOfLetters a
reasonable choice.

The 'else if (!bAfterLetter)' might as well be just 'else'.

The venom is in the tail. I do not know the details of this part of
hyphenation. Your addition of 'iIgnoreAtBeginning' seems OK. I think
you should also add 'iIgnoreAtBeginning' in the if branch (hyphenation
exceptions), but the results of a test fo are not quite in
favour. Perhaps you can have a look into this.

I added a long comment explaining various features, perhaps most to
myself.

I added cases to the test fo showing a word that is too short (when
one adds debug logging, one sees the effect), and 4 cases with a
hyphenation exception word.

Regards, Simon
Comment 4 Simon Pepping 2004-04-16 19:45:30 UTC
Created attachment 11264 [details]
An expanded test fo file
Comment 5 Simon Pepping 2004-04-16 19:46:39 UTC
Created attachment 11265 [details]
A slightly modified patch
Comment 6 Glen Mazza 2004-04-17 05:17:50 UTC
Your assumptions appear correct, I checked the Washington Post newspaper and saw
that hyphenation does indeed occur with words that have a period or comma at the
end of them.

Glen
Comment 7 Luca Furini 2004-04-19 10:30:35 UTC
Simon,

concernig names, unnecessary "if", etc. , I agree with you.

It seems to me that your change concerning hyphenation exceptions works, 
otherwise the hyphenation points would appear in the wrong place because of 
the punctuation marks.

The strange pdf generated is due, IMO, to a couple of problems:

-1-
In the last test case the text is (quite oddly) divided among 3 TextLM
 "**[...]** (philanthrop"
 "ic)."
 " "
Specifying the property linefeed-treatment="ignore", the text is all in a TLM.
Removing from the test file the linefeed after "(philanthropic).", the text is 
still split in two parts:
 "***************************"
 "*************************************** (philanthropic)."
So, it seems there is an irksome bug affecting text splitting.

-2-
The last line in a justified paragraph is sometimes justified too (bug 28314).
The "phantom linefeed" is by default treated as a space, and so it is adjusted.
Anyway, I was pleased to notice that, although shattered, the word is 
correctly collected and hyphenated.

Regards

    Luca
Comment 8 Simon Pepping 2004-04-26 19:35:42 UTC
Luca,

I agree with item 1. Although the spaces before the hyphenated word
are often incorrect, the word is hyphenated correctly, and the patch
solves the problem it set out to solve. 

I also agree with item 2. The break up of the word is incorrect. Also
the spaces before the hyphenated word are often incorrect. However,
this problem is not caused by the patch, it is just revealed by it.

The patch can be applied. The break-up and spacing problems should be
solved in a separate effort.

Simon
Comment 9 Simon Pepping 2004-05-29 09:20:31 UTC
Patch applied. Thanks.
Simon
Comment 10 Glenn Adams 2012-04-01 06:25:55 UTC
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed