Harmony
  1. Harmony
  2. HARMONY-6649

String.toLowerCase/toUpperCase incorrect for supplementary characters

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.0M15
    • Fix Version/s: 5.0M15
    • Component/s: Classlib
    • Labels:
      None

      Description

      Simple testcase:

          assertEquals("\uD801\uDC44", "\uD801\uDC1C".toLowerCase());
      

      Looking at modules/luni/src/main/java/java/lang/String.java, the problem is these methods iterate code units (char) not codepoints (int),
      and use Character.toLowerCase(char) and Character.toUpperCase(char), instead of Character.toLowerCase(int), and Character.toUpperCase(int)

      1. harmony6649.patch
        11 kB
        Tim Ellison
      2. HARMONY-6649_tests.patch
        1 kB
        Robert Muir
      3. HARMONY-6649_filepermission.patch
        0.9 kB
        Robert Muir

        Activity

        Hide
        Tim Ellison added a comment -

        Patch to String was re-applied at repo revision r1000700 (after file permission changes).

        Show
        Tim Ellison added a comment - Patch to String was re-applied at repo revision r1000700 (after file permission changes).
        Hide
        Hudson added a comment -

        Integrated in Harmony-1.5-head-linux-x86_64 #966 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/966/)
        Re-apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)

        Show
        Hudson added a comment - Integrated in Harmony-1.5-head-linux-x86_64 #966 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/966/ ) Re-apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        Hide
        Hudson added a comment -

        Integrated in Harmony-select-1.5-head-linux-x86_64 #120 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/120/)
        Re-apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)

        Show
        Hudson added a comment - Integrated in Harmony-select-1.5-head-linux-x86_64 #120 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/120/ ) Re-apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        Hide
        Hudson added a comment -

        Integrated in Harmony-select-1.5-head-linux-x86_64 #119 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/119/)
        Backing out changes for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        to get the build working again.

        Show
        Hudson added a comment - Integrated in Harmony-select-1.5-head-linux-x86_64 #119 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/119/ ) Backing out changes for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters) to get the build working again.
        Hide
        Hudson added a comment -

        Integrated in Harmony-1.5-head-linux-x86_64 #964 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/964/)
        Backing out changes for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        to get the build working again.

        Show
        Hudson added a comment - Integrated in Harmony-1.5-head-linux-x86_64 #964 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/964/ ) Backing out changes for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters) to get the build working again.
        Hide
        Tim Ellison added a comment -

        I'll back out the changes to String until we find all the places that need to be fixed in the bootsequence.

        Show
        Tim Ellison added a comment - I'll back out the changes to String until we find all the places that need to be fixed in the bootsequence.
        Hide
        Tim Ellison added a comment -

        Let's move this conversation to the dev list.

        Show
        Tim Ellison added a comment - Let's move this conversation to the dev list.
        Hide
        Robert Muir added a comment -

        I thought about this too,

        one concern (not knowing if there are more cases involved) would be if the input "should" be ascii, but "could" be something else.
        if String.toLowerCase had the ascii special-case with a fallback to ICU, it could fail less gracefully in such a situation if it encountered non-ascii rather than simply not matching, especially since unit tests tend to have more coverage for the ascii case...

        ...but this might be theoretical

        Show
        Robert Muir added a comment - I thought about this too, one concern (not knowing if there are more cases involved) would be if the input "should" be ascii, but "could" be something else. if String.toLowerCase had the ascii special-case with a fallback to ICU, it could fail less gracefully in such a situation if it encountered non-ascii rather than simply not matching, especially since unit tests tend to have more coverage for the ascii case... ...but this might be theoretical
        Hide
        Tim Ellison added a comment -

        So there are other places in the class library code that invoke toUpperCase/toLowerCase - so I'm back to wondering if we should be special casing the ascii version in String, and leaving ICU for the more complex cases.

        Show
        Tim Ellison added a comment - So there are other places in the class library code that invoke toUpperCase/toLowerCase - so I'm back to wondering if we should be special casing the ascii version in String, and leaving ICU for the more complex cases.
        Hide
        Robert Muir added a comment -

        Ah Tim, you are right... here's a patch that clears it up for me.

        Show
        Robert Muir added a comment - Ah Tim, you are right... here's a patch that clears it up for me.
        Hide
        Tim Ellison added a comment -

        You snipped the javadoc a bit too selectively.

        it says:
        " * Returns the string representing this permission's actions. It must be of

        • the form "read,write,execute,delete", all lower case and in the correct
        • order if there is more than one action.
          "

        i.e.that's the characteristic of the return string, not the input string which does need to go toLowerCase

        But this is going to be ascii.

        Show
        Tim Ellison added a comment - You snipped the javadoc a bit too selectively. it says: " * Returns the string representing this permission's actions. It must be of the form "read,write,execute,delete", all lower case and in the correct order if there is more than one action. " i.e.that's the characteristic of the return string, not the input string which does need to go toLowerCase But this is going to be ascii.
        Hide
        Robert Muir added a comment -

        I was looking at this failure myself:

        Why is java.io.FilePermission.toCanonicalActionString calling String.toLowerCase at all?

        it says in its javadocs:
        It must be of the form "read,write,execute,delete", all lower case and in the correct order if there is more than one action.

        Show
        Robert Muir added a comment - I was looking at this failure myself: Why is java.io.FilePermission.toCanonicalActionString calling String.toLowerCase at all? it says in its javadocs: It must be of the form "read,write,execute,delete", all lower case and in the correct order if there is more than one action.
        Hide
        Mark Hindess added a comment -
        Show
        Mark Hindess added a comment - This looks like it has broken the hudson builds? https://hudson.apache.org/hudson/view/Harmony/job/Harmony-1.5-head-linux-x86_64-full-tests
        Hide
        Hudson added a comment -

        Integrated in Harmony-select-1.5-head-linux-x86_64 #115 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/115/)
        Apply regression test for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)

        Show
        Hudson added a comment - Integrated in Harmony-select-1.5-head-linux-x86_64 #115 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/115/ ) Apply regression test for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        Hide
        Hudson added a comment -

        Integrated in Harmony-1.5-head-linux-x86_64 #956 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/956/)
        Apply regression test for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)

        Show
        Hudson added a comment - Integrated in Harmony-1.5-head-linux-x86_64 #956 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/956/ ) Apply regression test for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        Hide
        Robert Muir added a comment -

        I tested this, additionally the lucene test case that was failing now passes.

        in fact: all of core lucene now passes with harmony!

        as for the equalsIgnoreCase, i looked at the javadocs to see what the behavior should be.
        According to the javadocs, it seems the existing harmony behavior is correct,
        because they explicitly say that equalsIgnoreCase should use toLowerCase(char) and
        toUpperCase(char)

        Show
        Robert Muir added a comment - I tested this, additionally the lucene test case that was failing now passes. in fact: all of core lucene now passes with harmony! as for the equalsIgnoreCase, i looked at the javadocs to see what the behavior should be. According to the javadocs, it seems the existing harmony behavior is correct, because they explicitly say that equalsIgnoreCase should use toLowerCase(char) and toUpperCase(char)
        Hide
        Tim Ellison added a comment -

        I applied you regression test in LUNI module at repo revision r999591.

        Please verify by closing this issue that it was applied as you expected. Let's open a new issue for the equalsIgnoreCase.

        Show
        Tim Ellison added a comment - I applied you regression test in LUNI module at repo revision r999591. Please verify by closing this issue that it was applied as you expected. Let's open a new issue for the equalsIgnoreCase.
        Hide
        Robert Muir added a comment -

        here is a proposed patch to the tests:
        1. the test for the greek sigma had a bug, it never actually tested sigma in word-final position, only tested it 'isolated'.
        2. I added assertions that test supplementary chars for lowercase and uppercase.

        Show
        Robert Muir added a comment - here is a proposed patch to the tests: 1. the test for the greek sigma had a bug, it never actually tested sigma in word-final position, only tested it 'isolated'. 2. I added assertions that test supplementary chars for lowercase and uppercase.
        Hide
        Tim Ellison added a comment -

        Yes, but I don't want to forget the regression testcase for this commit.
        FYI I'm not going to be around much over the next week, so apologies if progress slows for a while

        Show
        Tim Ellison added a comment - Yes, but I don't want to forget the regression testcase for this commit. FYI I'm not going to be around much over the next week, so apologies if progress slows for a while
        Hide
        Robert Muir added a comment -

        Tim, the fix looks good to me. can we fix equalsIgnoreCase here too?

        Show
        Robert Muir added a comment - Tim, the fix looks good to me. can we fix equalsIgnoreCase here too?
        Hide
        Hudson added a comment -

        Integrated in Harmony-1.5-head-linux-x86_64 #949 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/949/)
        Apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)

        Show
        Hudson added a comment - Integrated in Harmony-1.5-head-linux-x86_64 #949 (See https://hudson.apache.org/hudson/job/Harmony-1.5-head-linux-x86_64/949/ ) Apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        Hide
        Hudson added a comment -

        Integrated in Harmony-select-1.5-head-linux-x86_64 #110 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/110/)
        Apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)

        Show
        Hudson added a comment - Integrated in Harmony-select-1.5-head-linux-x86_64 #110 (See https://hudson.apache.org/hudson/job/Harmony-select-1.5-head-linux-x86_64/110/ ) Apply fix for HARMONY-6649 (String.toLowerCase/toUpperCase incorrect for supplementary characters)
        Hide
        Tim Ellison added a comment -

        An improved fix has gone into the LUNI and nio_char modules at repo revision r998030. I'll leave this issue open until I have got some unit tests ready to include, but if you have a chance to look over this fix please send comments.

        Show
        Tim Ellison added a comment - An improved fix has gone into the LUNI and nio_char modules at repo revision r998030. I'll leave this issue open until I have got some unit tests ready to include, but if you have a chance to look over this fix please send comments.
        Hide
        Tim Ellison added a comment -

        Attaching file harmony6649.patch which is a prototype of the changes required for discussion.

        It needs more more work because on the RI

        String foo = "foo";
        foo.toLowerCase() == foo

        but it doesn't on this prototype, courtesy of ICU...

        Also needs work on equalsIgnoreCase I expect.

        Show
        Tim Ellison added a comment - Attaching file harmony6649.patch which is a prototype of the changes required for discussion. It needs more more work because on the RI String foo = "foo"; foo.toLowerCase() == foo but it doesn't on this prototype, courtesy of ICU... Also needs work on equalsIgnoreCase I expect.
        Hide
        Tim Ellison added a comment -

        Moving the discussion to the dev list
        http://markmail.org/message/cfmxbaacmoosripe

        Show
        Tim Ellison added a comment - Moving the discussion to the dev list http://markmail.org/message/cfmxbaacmoosripe
        Hide
        Robert Muir added a comment -

        Does this mean harmony might need these methods for its own internal use before ICU is available?

        When doing the toLowerCase(Locale)/toUpperCase(Locale), perhaps String.java could do:

        if (locale is not Turkish or Azeri or Lithuanian)
        while (ch < 0x7f)
        ( just do optimized fast subtraction/addition )
        ...
        // bail out completely and invoke 'UCharacter.xxx'

        this might be good for performance reasons? And harmony itself, if it uses this method at this point, is likely using Locale.ENGLISH or similar for consistent behavior (filenames, etc) ? Sorry I'm not too familiar with the codebase so I'm not sure if it would work. But it might speed up 'typical' lowercasing in any case, and as far as worst-case 2x for the "special casing": i find the "special" casing is going to be slow anyway: e.g. the Greek sigma example needs to calculate word boundaries!

        the Turkish/Azeri case is trickier than the existing code, I think it should use UCharacter.XXX too.
        The reason is it has to be able to handle the case from SpecialCasing where the 'dotted I' is written in decomposed form (e.g. I + COMBINING DOT ABOVE)

        Show
        Robert Muir added a comment - Does this mean harmony might need these methods for its own internal use before ICU is available? When doing the toLowerCase(Locale)/toUpperCase(Locale), perhaps String.java could do: if (locale is not Turkish or Azeri or Lithuanian) while (ch < 0x7f) ( just do optimized fast subtraction/addition ) ... // bail out completely and invoke 'UCharacter.xxx' this might be good for performance reasons? And harmony itself, if it uses this method at this point, is likely using Locale.ENGLISH or similar for consistent behavior (filenames, etc) ? Sorry I'm not too familiar with the codebase so I'm not sure if it would work. But it might speed up 'typical' lowercasing in any case, and as far as worst-case 2x for the "special casing": i find the "special" casing is going to be slow anyway: e.g. the Greek sigma example needs to calculate word boundaries! the Turkish/Azeri case is trickier than the existing code, I think it should use UCharacter.XXX too. The reason is it has to be able to handle the case from SpecialCasing where the 'dotted I' is written in decomposed form (e.g. I + COMBINING DOT ABOVE)
        Hide
        Tim Ellison added a comment -

        ... though if it is used as part of the bootstrapping of the class libraries there may need to be some way to do toUpperCase until we are ready to load ICU.

        Show
        Tim Ellison added a comment - ... though if it is used as part of the bootstrapping of the class libraries there may need to be some way to do toUpperCase until we are ready to load ICU.
        Hide
        Tim Ellison added a comment -

        Yeah, that sounds like a reasonable suggestion.

        The existing code in String#toUpperCase(Locale) predates our dependency on ICU, but it may be as simple as replacing the boat-load of code in there with a call to UCharacter.toUppercase(Locale, String)

        Show
        Tim Ellison added a comment - Yeah, that sounds like a reasonable suggestion. The existing code in String#toUpperCase(Locale) predates our dependency on ICU, but it may be as simple as replacing the boat-load of code in there with a call to UCharacter.toUppercase(Locale, String)
        Hide
        Robert Muir added a comment -

        I started looking at this to try to produce a patch, but I found more problems.

        For example, this Locale-sensitive lowerCase does not appear to handle greek final sigma correctly, or various other things
        from SpecialCasing (http://www.unicode.org/Public/4.0-Update/SpecialCasing-4.0.0.txt)

        To implement the casing algorithms here from Unicode ch3.13 is quite complex, is it possible
        for String.toLowerCase(Locale)/toUpperCase(Locale) to somehow use the ICU static methods in UCharacter?

        Show
        Robert Muir added a comment - I started looking at this to try to produce a patch, but I found more problems. For example, this Locale-sensitive lowerCase does not appear to handle greek final sigma correctly, or various other things from SpecialCasing ( http://www.unicode.org/Public/4.0-Update/SpecialCasing-4.0.0.txt ) To implement the casing algorithms here from Unicode ch3.13 is quite complex, is it possible for String.toLowerCase(Locale)/toUpperCase(Locale) to somehow use the ICU static methods in UCharacter?

          People

          • Assignee:
            Tim Ellison
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development