Uploaded image for project: 'Commons Codec'
  1. Commons Codec
  2. CODEC-127

Non-ascii characters in source files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.5
    • 1.6
    • None

    Description

      Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

      This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

      I think we should have a rule of using Unicode escapes for all such non-ascii characters.
      It's particularly important for non-ISO-8859-1 characters.

      Some example classes with non-ascii characters:

      binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
      language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
      language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
      language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
      language\ColognePhoneticTest.java:143             {"ganz", "Gänse"},
      language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
      language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
      language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
      language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
      language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
      language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
      language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
      language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
      

      The characters are probably not correct above, because I used a crude perl script to find them:

      perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" xxxx.java
      

      language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

      Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

      if (Character.isLetter('\ufffd'))

      which is an "unknown" character.

      Similarly for binary\Base64Test.java:96.

      It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

      [Possibly the characters got mangled at some point, or maybe they have always been wrong]

      The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

      Attachments

        Activity

          People

            Unassigned Unassigned
            sebb Sebb
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: