Issue 85360 - Move implementation of ApplyLreOrRleEmbedding() to i18npool ApplyNaturalBidiEmbedding() (was: Wrong BiDi-behavior for composed strings)
Move implementation of ApplyLreOrRleEmbedding() to i18npool ApplyNaturalBidiE...
Status: CONFIRMED
Product: Internationalization
Classification: Code
Component: i18npool
OOo 2.3.1 RC1
All All
: P3 trivial (vote)
: ---
Assigned To: erack
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-18 12:45 UTC by djihed
Modified: 2013-08-07 14:59 UTC (History)
7 users (show)

See Also:
Issue Type: TASK
Latest Confirmation on: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description djihed 2008-01-18 12:45:41 UTC
For RTL languages such as Arabic and Hebrew, OpenOffice in general fails to
correctly direct the interface because it doesn't correctly implement the BiDi
algorithm.

http://unicode.org/reports/tr9/

In particular, neutral characters such as the dot "." and parentheses are not
correctly displayed.

In issue #84553. A quick solution was to set the BiDi preferences of the widget
in question. This is more of a hack and needs to be done in numerous places
anywhere text appears. Forcing translators to use RLE or RLM should not be
needed either.

Please have a look at attachments for illustrations:

attachment #50301 [details]
attachment #50302 [details]
attachment #50964 [details]

(I filed this here because I have been referred to the EditEngine which is
responsible for this and which belongs here).
Comment 2 michael.ruess 2008-01-18 14:41:37 UTC
Reassigned to SBA.
Comment 3 hdu@apache.org 2009-01-23 16:20:31 UTC
Adjusted the summary to the root cause of these classes of problems.

Background details (from my entry in issue 78466#desc6):
There is a general problem with code which just composes strings by adding other strings  together. 
There are probably many parts of OOo where such compositions have just been designed and tested for 
LTR-strings.

Before the individual strings get merged into the bigger composition they each look good alone, 
because they seem to assume "natural layout" (the first strong character defines the default direction).

@sb: I suggest to add a String helper method which adds LRE/RLE...PDF (U+202A/U+202B...U+202C) 
pairs to strings that assume natural layout. This would help owners of string compositing code to keep 
the fixes for the problems simple, unless the composition itself requires reordering of parts.
Comment 4 Stephan Bergmann 2009-01-26 12:49:50 UTC
@er:  If I understand hdu correctly, he intends to address the following problem
in the following way.  The problem is as follows:  Consider the composition of
the two strings

U+05D0 HEBREW LETTER ALEF

and

U+0061 LATIN SMALL LETTER A
U+0028 LEFT PARENTHESIS
U+0062 LATIN SMALL LETTER B
U+0029 RIGHT PARENTHESIS

The bidirectional algorithm would erroneously render this like "(a(bA" (where
"A" represents alef) rather than like "a(b)A" (i.e., it would switch back to
right-to-left for the closing parenthesis after the embedded left-to-right run).
 This could be fixed by embedding the complete second string in

U+202A LEFT-TO-RIGHT EMBEDDING
...
U+202C POP DIRECTIONAL FORMATTING

Whether a given string intended for such composition would need to be wrapped in
LRE...PDF or RLE...PDF could be derived from the string content, e.g., by
looking at the directionality of the first strong character in the string.  So,
hdu proposes to add a function to the sal strings that, given a string, returns
the string embedded in either LRE...PDF or RLE...PDF, depending on the string's
contents.

Now, I would argue that such a function should not be added to the sal strings
directly (as it offers functionality that is above the rudimentary semantic
level at which the sal strings reside, and would need semantic character
data---to determine the directionality of the first strong character in a
string, see above---that is currently not available in sal), but could instead
be grouped with other I18N functionality.
Comment 5 alan 2009-01-26 13:02:22 UTC
@hdu:
an alternative to LRE/PDF (which is two characters) could be to a single LRM at
the end of the combined string, as described in issue 18024 
Comment 6 alan 2009-01-26 13:04:42 UTC
@hdu (with corrected typo):

an alternative to LRE/PDF (which is two characters) could be to add a single LRM
at the end of the combined string, as described in issue 18024 
Comment 7 hdu@apache.org 2009-01-26 13:41:11 UTC
Whether to do it with LRE/RLE..PDF or by adding a LRM/RLM marker at the end, is fine all with me. What is 
important is that the programmers doing string composition have a good chance to get it right. It must be 
as easy as possible by using some automatic string helpers. Because if these coders get it wrong all the 
resulting bugs usually hit one poor guy (==me) first, as seen by the countless related bugs.
Comment 8 hdu@apache.org 2009-01-26 13:41:14 UTC
Whether to do it with LRE/RLE..PDF or by adding a LRM/RLM marker at the end, is all fine with me. What is 
important is that the programmers doing string composition have a good chance to get it right. It must be 
as easy as possible by using some automatic string helpers. Because if these coders get it wrong all the 
resulting bugs usually hit one poor guy (==me) first, as seen by the countless related bugs.
Comment 9 ooo 2009-01-27 15:35:19 UTC
Using a LRE/RLE..PDF pair probably would be better in case someone wanted to
combine the result with yet another string again.

See also: issue 78466, and better samples in issue 32179:
http://www.openoffice.org/nonav/issues/showattachment.cgi/16816/OOo_parenthesis_bug.png
http://www.openoffice.org/nonav/issues/showattachment.cgi/16819/parens-ar.png
http://www.openoffice.org/nonav/issues/showattachment.cgi/16868/wrong_next_previous_button.png

I can provide some method in i18npool respectively unotools/i18n, but of course
that would not fix the various instances the issue incarnates. Once the
interface will be provided, applications will actually have do to use it.
Comment 10 thomas.lange 2009-01-28 08:07:53 UTC
TL->ER: Note: svtools/langtab.hxx (CWS vcl99) has now a function
ApplyLtrOrRtlEmphasis that adds the emphasis characters to the string if there
are none yet and the string does not consist of neutral chars only.
(HDU said to look just for the first not neutral char for this.)
Comment 11 hdu@apache.org 2009-01-28 09:05:29 UTC
> ApplyLtrOrRtlEmphasis

It works fine, great!
Once that method moves to a more general place in the i18n framework I suggest to rename it. E.g. to 
ApplyNaturalBidiEmbedding(). Also adding the embedding markers when the string doesn't contain 
neutral characters can be avoided as they aren't needed then.
Comment 12 thomas.lange 2009-01-28 10:49:18 UTC
TL->ER: Slight change for the mentioned function: it is now called
ApplyLreOrRleEmbedding.
Comment 13 thomas.lange 2009-01-28 10:51:30 UTC
.
Comment 14 erack 2009-03-23 22:13:20 UTC
Reassigning to spare time account.
Comment 15 erack 2009-08-19 23:09:42 UTC
Looking into moving the implementation of ApplyLreOrRleEmbedding() down to
i18npool I encountered some obstacle. The implementation takes advantage of
SvtSysLocale to obtain a cached character classification at almost no cost. That
is not available in i18npool, and we certainly don't want to create new
instances of XCharacterClassification for each string inspection, so would need
some cached instance there as well and refactor some coding details. Nothing
hard to implement, just doesn't fit into my schedule for OOo3.2, therefor
retargeting to OOo3.3. Anyway, according to 'hdu' the original bug is fixed with
ApplyLreOrRleEmbedding(), so I'm setting the issue type to TASK.

However, best would even be if SvtSysLocale could pass the cached instances down
to i18npool instead of having to create another instance there.
Comment 16 hdu@apache.org 2009-08-20 07:08:50 UTC
@erack: the idea behind this method is just to find the first codepoint in the string with "strong" bidi-
direction properties. Since icu is already available in i18npool and its u_charDirection() method is cheap 
there is probably not much need for fancy caching directly or indirectly (by tunneling the cache of all 
character properties cache from other modules).
Comment 17 hdu@apache.org 2009-08-20 07:48:43 UTC
Splitting the method getDefaultDirection() off is probably a good idea too.
Comment 18 erack 2009-08-22 20:41:13 UTC
@hdu: A rewrite to directly use ICU sounds indeed most promising. Thanks for the
thought.
Comment 19 erack 2010-02-24 19:53:06 UTC
Fruit flies like banana.. retargeting to OOo3.x, should actually be OOo3.4 but
that target doesn't exist yet.
Comment 20 hdu@apache.org 2010-06-30 09:53:05 UTC
Status update: When ICU's http://bugs.icu-project.org/trac/ticket/7772 ("Fast string direction detection") 
is resolved the method getDefaultDirection() could be implemented more directly using that new ICU call, 
if OOo's ICU baseline had it available.