Issue 68047 - ZWJ: The zero width joiner shouldn't be filtered out
Summary: ZWJ: The zero width joiner shouldn't be filtered out
Status: CLOSED FIXED
Alias: None
Product: gsl
Classification: Code
Component: code (show other issues)
Version: OOo 2.0.3
Hardware: All All
: P3 Trivial (vote)
Target Milestone: OOo 2.1
Assignee: stefan.baltzer
QA Contact: issues@gsl
URL:
Keywords:
Depends on:
Blocks: 71241
  Show dependency tree
 
Reported: 2006-08-01 16:32 UTC by caolanm
Modified: 2007-08-27 10:45 UTC (History)
3 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
screenshot (3.06 KB, image/png)
2006-08-01 16:33 UTC, caolanm
no flags Details
test document (7.29 KB, application/vnd.oasis.opendocument.text)
2006-08-01 16:33 UTC, caolanm
no flags Details
patch to fix (678 bytes, patch)
2006-08-01 16:33 UTC, caolanm
no flags Details | Diff
testcase (78 bytes, text/plain)
2006-08-25 00:39 UTC, harshula
no flags Details
Correct rendering of testcase in gedit (5.08 KB, image/png)
2006-08-25 00:41 UTC, harshula
no flags Details
Incorrect rendering of testcase in openoffice (6.22 KB, image/png)
2006-08-25 00:41 UTC, harshula
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description caolanm 2006-08-01 16:32:54 UTC
The zero width joiner and other such zero width chars shouldn't be filtered out
by vcl I believe. Attached is a screenshot of what the next attachment should
look like (e.g. like it does in gedit). With the final patch applied, OOo does
the right thing.
Comment 1 caolanm 2006-08-01 16:33:16 UTC
Created attachment 38182 [details]
screenshot
Comment 2 caolanm 2006-08-01 16:33:38 UTC
Created attachment 38183 [details]
test document
Comment 3 caolanm 2006-08-01 16:33:54 UTC
Created attachment 38184 [details]
patch to fix
Comment 5 hdu@apache.org 2006-08-14 10:51:55 UTC
See issue 51400 on why they got filtered out. Unless we want a regression there
I cannot apply the patch, which reverts the fix for that issue.

I'll have to work on this: the layout engines that know how to deal with ZWNJ
should see them, others shouldn't.
Comment 6 harshula 2006-08-25 00:16:22 UTC
Hi,

1) re: "See issue 51400 on why they got filtered out. Unless we want a 
regression there I cannot apply the patch, which reverts the fix for that 
issue."

Actually, the 'fix' for issue 51400 is a regression and should never have been 
applied. As a result of this 'fix', it has broken the rendering of South Asian 
scripts on Open Office.


2) re: "I'll have to work on this: the layout engines that know how to deal 
with ZWNJ should see them, others shouldn't."

It's not just ZWNJ (0x200C), it's also ZWJ (0x200D) which is arguably more 
important.

Codepoints used for Layout Controls need to reach the renderer, hence they 
should not be filtered out by the application. e.g. ZWJ (0x200D) should not be 
filtered out by Open Office because it is required by ICU. Hence I would argue 
that it is the layout engines (renderer) that can't handle the 'Layout Controls' 
that need to be fixed.

I'm not familiar with Open Office source code and due to the very large size of 
its source tree, it's unlikely I'll ever be familiar with the source code. 
Therefore, would you be able to please enumerate the different layout engines 
Open Office has to work with?


3) Read the section "Cursive Connection and Ligatures" p.389-391:
http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf

Furthermore, this unicode proposal requires ZWJ to be available to the renderer:
http://www.unicode.org/review/pr-37.pdf

"This proposal intends to rectify these problems, clarifying how the ZERO WIDTH 
JOINER is to be applied in scripts, and consolidating common mechanisms for 
equivalent problems that exist in several scripts."

which was accepted:
http://www.unicode.org/review/resolved-pri.html

"Resolution: Closed 2004-08-24. UTC accepted the proposal and will create an 
Indic conjoining behavior model."


4) Not only does the ZWJ have to reach the renderer (e.g. Pango or ICU) but it 
is also needed at the font lookup stage:

http://bugzilla.gnome.org/show_bug.cgi?id=161981
http://dev.icu-project.org/cgi-bin/icu-bugs?findid=4710
http://dev.icu-project.org/cgi-bin/icu-bugs?findid=4711

Regards,
Harshula
Comment 7 harshula 2006-08-25 00:39:21 UTC
Created attachment 38758 [details]
testcase
Comment 8 harshula 2006-08-25 00:41:10 UTC
Created attachment 38759 [details]
Correct rendering of testcase in gedit
Comment 9 harshula 2006-08-25 00:41:59 UTC
Created attachment 38760 [details]
Incorrect rendering of testcase in openoffice
Comment 10 harshula 2006-08-25 00:47:06 UTC
Hi,

Here's a Sinhala font that contains the necessary glyphs to render
icu-zwj-handling.txt:

http://sinhala.sourceforge.net/files/lklug.hj.ttf

Regards,
Harshula
Comment 11 sajithvk 2006-10-11 08:26:49 UTC
I am not familiar with OO source code.
It seems that ZWJ and ZWNJ are getting filtered even
on malayalam (ml or mal) language. This has broken
malayalam rendering competely....

It need to be reverted for languages which requires
ZWJ and ZWNJ, atleast for malayalam it is a must
Comment 12 hdu@apache.org 2006-10-13 14:13:07 UTC
The arguments mentioned above are excellent. The control codes are essential to
get the layout engines do their job properly.

The problem is that some engines for some scripts don't understand some of the
control codes. Which combinations of script/engine/engineversion/controlcode
have problems requires a huge testing effort.

@ES/SBA: I think the gains by allowing some of the control codes back outweigh
the small regressions that could be introduced by the problems mentioned above.
The control codes should only appear in scripts where the layout engines know
how to handle them anyway...

In CWS vcl68 the ZWJ/ZWNJ won't get filtered out anymore.
Comment 13 hdu@apache.org 2006-10-19 08:12:57 UTC
@SBA: please verify in CWS vcl68
Testing hints:
- test with "normal documents" on UNX and WIN platforms
- test with test document
- test with multiple CTL scripts where ZW*J (U+200B,U+200D) were inserted
Comment 14 stefan.baltzer 2006-10-25 15:18:08 UTC
SBA: Verified in CWS vcl68.
Comment 15 harshula 2006-11-09 19:29:54 UTC
Hi,

I have my reservations regarding cmc's patch that was applied to fix this bug.
It should fix the ZWJ issue I was seeing with Open Office and Sinhala since
early 2006 (http://mail.lug.lk/lurker/message/20060410.130454.19cefb01.en.html).
However,

1) The reporter of issue 51400 was testing on Linux systems which should have
been using ICU as the renderer. It was indicated there was "No problem with
win32". Therefore, the bug should have been opened for ICU, not openoffice.
Hence, perhaps there's an argument for completely yanking that patch.


2) Maybe there's a better way to approach the IsControlChar() function. At the
moment for a common character, like a letter from the English alphabet, there
are six (6) 'if statements' that have to be completed before returning false.
And for 'control' characters, in the worst case, there are 6 'if statements
before returning true.
 Would it not be better to add a first 'if statement' to check whether the char
is within the common range U+0020 (space) to, let us say, U+2027 (hyphenation
point) and return false immediately? The fast path will ensure that for the
majority of characters analysed by IsControlChar() it will return after one 'if
statement' and for a minority of chracters it will, in the worst case, return
after 7 'if statements'.


3) The patch also stops filtering out U+200B (ZERO WIDTH SPACE) but continues to
filter out U+200E (LEFT-TO-RIGHT MARK) and U+200F (RIGHT-TO-LEFT MARK). I
suspect LRM and RLM are required more than ZWS, if the renderer is allowed to do
the BiDi analysis.

I have discussed this filtering issue with Eric Mader (ICU) and Behdad Esfahbod
(Pango). We all feel that an application, e.g. openoffice, should not be
filtering formating characters that are required by the renderer for the correct
layout of text. BTW, Eric also mentioned that ICU has a ParagraphLayout class in
the Layout Extensions Library which does the BiDi analysis.

Regards,
Harshula
Comment 16 caolanm 2006-11-09 21:02:39 UTC
FWIW, possibly orthogonal to this is that in icu itself there is a "filter zwj"
flag which is active for some scripts, i.e.
icu/source/layout/IndicClassTables.cpp and we build with "system" icu 3.6, and
there recently for Malayam we changed...

-#define MLYM_SCRIPT_FLAGS (SF_MPRE_FIXUP | SF_NO_POST_BASE_LIMIT |
SF_FILTER_ZERO_WIDTH)
+#define MLYM_SCRIPT_FLAGS (SF_MPRE_FIXUP | SF_NO_POST_BASE_LIMIT)

to render malayam correctly, so maybe it's worth noting that icu has a possible
layer of ZWJ filtering
Comment 17 harshula 2006-11-10 14:05:31 UTC
Hi cmc,

Please refer to the 3 URIs, bug reports, provided at the end of my comment on
Thu Aug 24 15:16:22 -0800 2006. 

When I requested ZWJ to be allowed to reach the font lookup stage, the filter ZW
flag was introduced to Pango by Owen and later to ICU by Eric as a conservative
approach. i.e. Don't filter ZWJ for Sinhala but continue, as before, to filter
for Indic scripts.

Please be aware that the SF_FILTER_ZERO_WIDTH flag *only* filters the ZWJ/ZWNJ
from the font lookup stage, not ICU's layout stage. i.e. ZWJ/ZWNJ are available
to IndicReordering::findSyllable().

So in the case of *Malayalam*, are you sure the ZWJ is needed during the gsub
lookup in the font? If you want to discuss this further, feel free to send my an
email.

Regards,
Harshula
Comment 18 harshula 2007-07-03 18:23:00 UTC
Hi,

FYI, ZWJ references in GSUB lookups are now valid for Sinhala, Malayalam and 
Kannada fonts.

Regards,
Harshula
Comment 19 caolanm 2007-08-27 10:45:32 UTC
closed