47726 – Line breaking a word in the Thai language.

Bug 47726 - Line breaking a word in the Thai language.

Summary: Line breaking a word in the Thai language.

Status:	NEW

Alias:	None

Product:	Fop - Now in Jira
Classification:	Unclassified
Component:	pdf (show other bugs)
Version:	0.94
Hardware:	PC Windows XP

Importance:	P3 normal
Target Milestone:	---
Assignee:	fop-dev

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-08-24 04:44 UTC by Hung S Nguyen
Modified:	2012-04-07 01:51 UTC (History)
CC List:	0 users

Attachments
TextLayoutManager.java (12.55 KB, application/x-zip-compressed) 2009-12-03 01:55 UTC, Hung S Nguyen	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hung S Nguyen 2009-08-24 04:44:00 UTC

When exporting a PDF, it can't export exactly the Thai language. Although it can show the Thai characters, but it  breaks line between a word. I tried to use attributes relating to white space, but it can't. How could I fix this issue?

Ex: my fo file:
...
<fo:block font-weight="normal" font-family="Arial MS" line-height="12pt" font-size="12pt" space-before.optimum="8pt" space-after.optimum="8pt" start-indent="1cm" end-indent="1cm">1เป็นส่วนผสมที่ละลายได้ทันที 2และไม่จำเป็นต้องใช้กากไก่ในการเตรียม ซึ่งน้ำเกรวี่จะมีลักษณะเนียน และมีกลิ่นรสของไก่ที่ 5หอมอร่อย 
</fo:block>
...

Thanks
Hung

Comment 1 Manuel Mall 2009-08-24 05:05:01 UTC

I have no idea how in the Thai language word boundaries are determined but from your snippet below it appears to me that Thai word boundaries are not indicated by whitespace. I suggest you try to put a ZWSP (Zero Width Space) &#x200b; between the Thai characters where there are Thai word boundaries.

Comment 2 J.Pietschmann 2009-08-27 12:03:57 UTC

The Unicode UAX#14 indicates that proper line breaking for the Tahi language
involves morphological analysis in order to determine word boundaries. The
standard considered this as too complex and left it to the "higher levels
of processing".

The libthai project (http://linux.thai.net/projects/libthai) produces open source
software for this purpose, written in C/C++, which is used by Mozilla, Gnome
applications and other OSS. Apparently, Java applications aren't as easily
supported, yet.

Comment 3 Peter S. Housel 2009-08-31 10:19:51 UTC

(In reply to comment #2)
> The Unicode UAX#14 indicates that proper line breaking for the Thai language
> involves morphological analysis in order to determine word boundaries. The
> standard considered this as too complex and left it to the "higher levels
> of processing".
> The libthai project (http://linux.thai.net/projects/libthai) produces open
> source
> software for this purpose, written in C/C++, which is used by Mozilla, Gnome
> applications and other OSS. Apparently, Java applications aren't as easily
> supported, yet.

The com.ibm.icu.text.ThaiBreakIterator class in recent versions of ICU4J can supposedly do this. It makes use of an included dictionary of Thai words in order to locate valid break points.

Comment 4 Hung S Nguyen 2009-11-17 00:51:08 UTC

I'm sorry, I was busy with other tasks, I wasn't able to go on. Now, I'm comming back this issue, I tried to do many ways, I inputed many attributes about the whitespace and inserted the &#x200b; char between the Thai words, even I read code, but I still not find any way to break line as I expected.

Assumption I use the ICU4J to put the &#x200b; char between the Thai string correctly, how could we break line as we expect? 

Do we have attributes that can group the words and break line with the group? or break line with the whitespaces?

Thank you very much

Comment 5 J.Pietschmann 2009-11-17 13:20:18 UTC

(In reply to comment #4)
> Assumption I use the ICU4J to put the &#x200b; char between the Thai string
> correctly, how could we break line as we expect? 

This should to work. AFAICT Thai letters are mapped to the class AL (ordinary
letter) for line breaking purposes in FOP, which means FOP wouldn't break lines
in Thai text except around the Zero Width Spaces.

Comment 6 Hung S Nguyen 2009-12-03 01:55:47 UTC

Created attachment 24664 [details]
TextLayoutManager.java

I don't think that Thai letters are mapped to the class AL. When you debug in LineBreakStatus.java --> nextChar(), if you print the currentClass, it will be 30 (SA). SA means South East Asian (http://unicode.org/reports/tr14/#SA). 

If it is SA, it's able to breaks line at any postion of Thai word. In comment of LineBreakStatus.java, I also see: "* TODO: Better handling for AI, SA, CB and other line break classes.".

Now, I fixed issue in FOP 0.94 and attached my file changed. Do you agree with my fix? Please give me your idea. 

Thanks
Hung

Comment 7 Glenn Adams 2012-04-07 01:41:59 UTC

resetting P2 open bugs to P3 pending further review