Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 1.6
    • Labels:
      None

      Description

      Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

      This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

      I think we should have a rule of using Unicode escapes for all such non-ascii characters.
      It's particularly important for non-ISO-8859-1 characters.

      Some example classes with non-ascii characters:

      binary\Base64Test.java:96         byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
      language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", "664645214"},
      language\ColognePhoneticTest.java:130         String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
      language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
      language\ColognePhoneticTest.java:143             {"ganz", "Gänse"},
      language\DoubleMetaphoneTest.java:1222         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
      language\DoubleMetaphoneTest.java:1227         this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
      language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
      language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
      language\SoundexTest.java:375             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
      language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
      language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
      language\SoundexTest.java:395             Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
      

      The characters are probably not correct above, because I used a crude perl script to find them:

      perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" xxxx.java
      

      language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

      Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

      if (Character.isLetter('\ufffd'))

      which is an "unknown" character.

      Similarly for binary\Base64Test.java:96.

      It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

      [Possibly the characters got mangled at some point, or maybe they have always been wrong]

      The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)

        Activity

        Sebb created issue -
        Hide
        Gary Gregory added a comment -

        The build deals with this by specifying the encoding in key places.

        In eclipse, I set the encoding to UTF-8 for the source folders.

        Seeing the real chars in the source is nicer but means you may have to deal with your IDE.

        An alternative would be to save IDE settings in SVN. How about that?

        – Posted from Bugbox for iPhone

        Show
        Gary Gregory added a comment - The build deals with this by specifying the encoding in key places. In eclipse, I set the encoding to UTF-8 for the source folders. Seeing the real chars in the source is nicer but means you may have to deal with your IDE. An alternative would be to save IDE settings in SVN. How about that? – Posted from Bugbox for iPhone
        Hide
        Sebb added a comment -

        The problem is that it's not possible to see what the test data is in the IDE (apart from the German chars).

        Also, unless you tell SVN the encoding (e.g. via mime-type), diff e-mails (and possibly conversion to local EOL) may suffer.

        Saving IDE settings in SVN is a non-starter, because there are many different IDEs, and it's anyway not possible to have the settings automatically picked up, as far as I know.

        Have a look again at the non-ISO-8858-1 characters and see if they are correct. I suspect not, as they all appear to be the unspecified character (\ufffd), at least when treated as UTF-8.

        Show
        Sebb added a comment - The problem is that it's not possible to see what the test data is in the IDE (apart from the German chars). Also, unless you tell SVN the encoding (e.g. via mime-type), diff e-mails (and possibly conversion to local EOL) may suffer. Saving IDE settings in SVN is a non-starter, because there are many different IDEs, and it's anyway not possible to have the settings automatically picked up, as far as I know. Have a look again at the non-ISO-8858-1 characters and see if they are correct. I suspect not, as they all appear to be the unspecified character (\ufffd), at least when treated as UTF-8.
        Hide
        Gary Gregory added a comment -

        I see now, what a mess.

        Show
        Gary Gregory added a comment - I see now, what a mess.
        Hide
        Sebb added a comment - - edited

        Here's the full list of lines containing non-ASCII characters:

        java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
        java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, // Ü
        java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, // ├âÔÇô
        java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} // ├â┼©
        java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
        test/org/apache/commons/codec/binary/Base64Test.java:96        byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
        test/org/apache/commons/codec/language/ColognePhoneticTest.java:110            {"m├Ânchengladbach", "664645214"},
        test/org/apache/commons/codec/language/ColognePhoneticTest.java:130        String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        test/org/apache/commons/codec/language/ColognePhoneticTest.java:137            {"Meyer", "M├╝ller"},
        test/org/apache/commons/codec/language/ColognePhoneticTest.java:143            {"ganz", "Gänse"},
        test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227        this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        test/org/apache/commons/codec/language/SoundexTest.java:367        if (Character.isLetter('´┐¢')) {
        test/org/apache/commons/codec/language/SoundexTest.java:369                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        test/org/apache/commons/codec/language/SoundexTest.java:375            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        test/org/apache/commons/codec/language/SoundexTest.java:387        if (Character.isLetter('´┐¢')) {
        test/org/apache/commons/codec/language/SoundexTest.java:389                Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        test/org/apache/commons/codec/language/SoundexTest.java:395            Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93        String[] names = { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47                { "Nu├▒ez", "spanish", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49                { "─îapek", "czech", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52                { "Küçük", "turkish", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55                { "Ceauşescu", "romanian", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57                { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58                { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59                { "ÎøÎö΃", "hebrew", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60                { "ácz", "any", EXACT },
        test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61                { "átz", "any", EXACT } });
        

        Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding, but I could be wrong.
        [You'll need to look at it in the source file itself - the Perl script I used is crude and does not display non-ASCII properly]

        The other dubious entris are:

        Base64Test.java:96
        DoubleMetaphoneTest.java:1222
        DoubleMetaphoneTest.java:1227
        and most of the SoundexTest.java entries.

        Show
        Sebb added a comment - - edited Here's the full list of lines containing non-ASCII characters: java/org/apache/commons/codec/language/ColognePhonetic.java:264 private static final char [][] PREPROCESS_MAP = new char [][]{{'\u00C4', 'A'}, // ├âÔÇ× java/org/apache/commons/codec/language/ColognePhonetic.java:265 {'\u00DC', 'U'}, // ├â┼ô java/org/apache/commons/codec/language/ColognePhonetic.java:266 {'\u00D6', 'O'}, // ├âÔÇô java/org/apache/commons/codec/language/ColognePhonetic.java:267 {'\u00DF', 'S'} // ├â┼© java/org/apache/commons/codec/language/ColognePhonetic.java:388 * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢. test/org/apache/commons/codec/binary/Base64Test.java:96 byte [] decode = b64.decode( "SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=" ); test/org/apache/commons/codec/language/ColognePhoneticTest.java:110 { "m├Ânchengladbach" , "664645214" }, test/org/apache/commons/codec/language/ColognePhoneticTest.java:130 String [][] data = {{ "bergisch-gladbach" , "174845214" }, { "M├╝ller-L├╝denscheidt" , "65752682" }}; test/org/apache/commons/codec/language/ColognePhoneticTest.java:137 { "Meyer" , "M├╝ller" }, test/org/apache/commons/codec/language/ColognePhoneticTest.java:143 { "ganz" , "G├ñnse" }, test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222 this .getDoubleMetaphone().isDoubleMetaphoneEqual( "´┐¢" , "S" ); test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227 this .getDoubleMetaphone().isDoubleMetaphoneEqual( "´┐¢" , "N" ); test/org/apache/commons/codec/language/SoundexTest.java:367 if ( Character .isLetter('´┐¢')) { test/org/apache/commons/codec/language/SoundexTest.java:369 Assert.assertEquals( "´┐¢000" , this .getSoundexEncoder().encode( "´┐¢" )); test/org/apache/commons/codec/language/SoundexTest.java:375 Assert.assertEquals( "", this .getSoundexEncoder().encode(" ´┐¢")); test/org/apache/commons/codec/language/SoundexTest.java:387 if ( Character .isLetter('´┐¢')) { test/org/apache/commons/codec/language/SoundexTest.java:389 Assert.assertEquals( "´┐¢000" , this .getSoundexEncoder().encode( "´┐¢" )); test/org/apache/commons/codec/language/SoundexTest.java:395 Assert.assertEquals( "", this .getSoundexEncoder().encode(" ´┐¢")); test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93 String [] names = { "├ícz" , "├ítz" , "Ign├ícz" , "Ign├ítz" , "Ign├íc" }; test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47 { "Nu├▒ez" , "spanish" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49 { "─îapek" , "czech" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52 { "K├╝├º├╝k" , "turkish" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55 { "Ceau┼ƒescu" , "romanian" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57 { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é" , "greek" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58 { "ðƒÐâÐêð║ð©ð¢" , "cyrillic" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59 { "ÎøÎö΃" , "hebrew" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60 { "├ícz" , "any" , EXACT }, test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61 { "├ítz" , "any" , EXACT } }); Note the comment at ColognePhonetic.java:388 - this does not seem to make sense in any encoding, but I could be wrong. [You'll need to look at it in the source file itself - the Perl script I used is crude and does not display non-ASCII properly] The other dubious entris are: Base64Test.java:96 DoubleMetaphoneTest.java:1222 DoubleMetaphoneTest.java:1227 and most of the SoundexTest.java entries.
        Hide
        Sebb added a comment -

        Just done a comparison of the various versions of ColognePhonetic.java in trunk.

        The corruption of the comments on PREPROCESS_MAP occurred between r1080701 and r1087901 (April 1st, ironically).

        This also corrupted other comments, and the string at line 382.
        The SVN log message says "Annotate with @Override and @Deprecated" - were those added automatically perhaps?

        Show
        Sebb added a comment - Just done a comparison of the various versions of ColognePhonetic.java in trunk. The corruption of the comments on PREPROCESS_MAP occurred between r1080701 and r1087901 (April 1st, ironically). This also corrupted other comments, and the string at line 382. The SVN log message says "Annotate with @Override and @Deprecated" - were those added automatically perhaps?
        Hide
        Sebb added a comment -

        SoundexTest appears to have been corrupted in r1075426 => r1080414.
        Log comment says "Keep these files in UTF-8 encoding for proper Javadoc processing"
        However, I suspect the file was originally in ISO-8859-1, not UTF-8.

        Show
        Sebb added a comment - SoundexTest appears to have been corrupted in r1075426 => r1080414. Log comment says "Keep these files in UTF-8 encoding for proper Javadoc processing" However, I suspect the file was originally in ISO-8859-1, not UTF-8.
        Hide
        Gary Gregory added a comment -

        Sebb: Thank you for your Javadoc fixes in trunk and branches/generics.

        Show
        Gary Gregory added a comment - Sebb: Thank you for your Javadoc fixes in trunk and branches/generics.
        Gary Gregory made changes -
        Field Original Value New Value
        Summary Non-ascii characters in test source files Non-ascii characters in source files
        Gary Gregory committed 1157545 (1 file)
        Gary Gregory committed 1157546 (1 file)
        Hide
        Gary Gregory added a comment -

        Fixed:

        java/org/apache/commons/codec/language/ColognePhonetic.java:264    private static final char[][] PREPROCESS_MAP = new char[][]{{'\u00C4', 'A'}, // ├âÔÇ×
        java/org/apache/commons/codec/language/ColognePhonetic.java:265        {'\u00DC', 'U'}, // Ü
        java/org/apache/commons/codec/language/ColognePhonetic.java:266        {'\u00D6', 'O'}, // ├âÔÇô
        java/org/apache/commons/codec/language/ColognePhonetic.java:267        {'\u00DF', 'S'} // ├â┼©
        
        Show
        Gary Gregory added a comment - Fixed: java/org/apache/commons/codec/language/ColognePhonetic.java:264 private static final char [][] PREPROCESS_MAP = new char [][]{{'\u00C4', 'A'}, // ├âÔÇ× java/org/apache/commons/codec/language/ColognePhonetic.java:265 {'\u00DC', 'U'}, // ├â┼ô java/org/apache/commons/codec/language/ColognePhonetic.java:266 {'\u00D6', 'O'}, // ├âÔÇô java/org/apache/commons/codec/language/ColognePhonetic.java:267 {'\u00DF', 'S'} // ├â┼©
        Gary Gregory committed 1157549 (1 file)
        Gary Gregory committed 1157550 (1 file)
        Hide
        Gary Gregory added a comment -

        Fixed:

        java/org/apache/commons/codec/language/ColognePhonetic.java:388     * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
        
        Show
        Gary Gregory added a comment - Fixed: java/org/apache/commons/codec/language/ColognePhonetic.java:388 * Converts the string to upper case and replaces germanic umlauts, and the ├óÔé¼┼ô├â┼©├óÔé¼´┐¢.
        Hide
        Gary Gregory added a comment -

        Sebb: "The SVN log message says "Annotate with @Override and @Deprecated" - were those added automatically perhaps?"

        Yes, more thank likely, using Eclipse.

        Show
        Gary Gregory added a comment - Sebb: "The SVN log message says "Annotate with @Override and @Deprecated" - were those added automatically perhaps?" Yes, more thank likely, using Eclipse.
        sebb committed 1157596 (1 file)
        sebb committed 1157597 (1 file)
        Hide
        Sebb added a comment - - edited

        I now get:

        commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:110      {"m├Ânchengladbach", "664645214"},
        commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:130      String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
        commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:143             {"ganz", "Gänse"},
        commons-codec-generics/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222     this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        commons-codec-generics/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227     this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93 String[] names = { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47           { "Nu├▒ez", "spanish", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49           { "─îapek", "czech", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52           { "Küçük", "turkish", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55           { "Ceauşescu", "romanian", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57           { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58           { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59           { "ÎøÎö΃", "hebrew", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60           { "ácz", "any", EXACT },
        commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61           { "átz", "any", EXACT } });
        

        and

        commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:110         {"m├Ânchengladbach", "664645214"},
        commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:130       String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:137          {"Meyer", "M├╝ller"},
        commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:143          {"ganz", "Gänse"},
        commons-codec/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227      this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        commons-codec/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1232      this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        commons-codec/src/test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93  String[] names = { "ácz", "átz", "Ignácz", "Ignátz", "Ignác" };
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47           { "Nu├▒ez", "spanish", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49           { "─îapek", "czech", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52           { "Küçük", "turkish", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55           { "Ceauşescu", "romanian", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57           { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é", "greek", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58           { "ðƒÐâÐêð║ð©ð¢", "cyrillic", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59           { "ÎøÎö΃", "hebrew", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60           { "ácz", "any", EXACT },
        commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61           { "átz", "any", EXACT } });
        

        This was using an updated version of the script that uses File::Find to process directory traversal better.
        (Some lines shortened above by manually removing leading spaces)

        I think all the actual errors have now been fixed.

        The remaining lines contain some non-ASCII characters which could be replaced by Unicode escapes for better portability.
        However, that would make it harder to read the code in some cases.
        So I'm thinking of using Unicode escapes in the Strings, but adding the original as an end-of-line comment.
        The comments might still get mangled, but at least the code would not, and it would be easy to reconstruct the comments from the Unicode.

        WDYT?

        Show
        Sebb added a comment - - edited I now get: commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:110 { "m├Ânchengladbach" , "664645214" }, commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:130 String [][] data = {{ "bergisch-gladbach" , "174845214" }, { "M├╝ller-L├╝denscheidt" , "65752682" }}; commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:137 { "Meyer" , "M├╝ller" }, commons-codec-generics/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:143 { "ganz" , "G├ñnse" }, commons-codec-generics/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1222 this .getDoubleMetaphone().isDoubleMetaphoneEqual( "´┐¢" , "S" ); commons-codec-generics/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227 this .getDoubleMetaphone().isDoubleMetaphoneEqual( "´┐¢" , "N" ); commons-codec-generics/src/test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93 String [] names = { "├ícz" , "├ítz" , "Ign├ícz" , "Ign├ítz" , "Ign├íc" }; commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47 { "Nu├▒ez" , "spanish" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49 { "─îapek" , "czech" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52 { "K├╝├º├╝k" , "turkish" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55 { "Ceau┼ƒescu" , "romanian" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57 { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é" , "greek" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58 { "ðƒÐâÐêð║ð©ð¢" , "cyrillic" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59 { "ÎøÎö΃" , "hebrew" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60 { "├ícz" , "any" , EXACT }, commons-codec-generics/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61 { "├ítz" , "any" , EXACT } }); and commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:110 { "m├Ânchengladbach" , "664645214" }, commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:130 String [][] data = {{ "bergisch-gladbach" , "174845214" }, { "M├╝ller-L├╝denscheidt" , "65752682" }}; commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:137 { "Meyer" , "M├╝ller" }, commons-codec/src/test/org/apache/commons/codec/language/ColognePhoneticTest.java:143 { "ganz" , "G├ñnse" }, commons-codec/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1227 this .getDoubleMetaphone().isDoubleMetaphoneEqual( "´┐¢" , "S" ); commons-codec/src/test/org/apache/commons/codec/language/DoubleMetaphoneTest.java:1232 this .getDoubleMetaphone().isDoubleMetaphoneEqual( "´┐¢" , "N" ); commons-codec/src/test/org/apache/commons/codec/language/bm/BeiderMorseEncoderTest.java:93 String [] names = { "├ícz" , "├ítz" , "Ign├ícz" , "Ign├ítz" , "Ign├íc" }; commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:47 { "Nu├▒ez" , "spanish" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:49 { "─îapek" , "czech" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:52 { "K├╝├º├╝k" , "turkish" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:55 { "Ceau┼ƒescu" , "romanian" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:57 { "╬æ╬│╬│╬Á╬╗¤î¤Ç╬┐¤à╬╗╬┐¤é" , "greek" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:58 { "ðƒÐâÐêð║ð©ð¢" , "cyrillic" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:59 { "ÎøÎö΃" , "hebrew" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:60 { "├ícz" , "any" , EXACT }, commons-codec/src/test/org/apache/commons/codec/language/bm/LanguageGuessingTest.java:61 { "├ítz" , "any" , EXACT } }); This was using an updated version of the script that uses File::Find to process directory traversal better. (Some lines shortened above by manually removing leading spaces) I think all the actual errors have now been fixed. The remaining lines contain some non-ASCII characters which could be replaced by Unicode escapes for better portability. However, that would make it harder to read the code in some cases. So I'm thinking of using Unicode escapes in the Strings, but adding the original as an end-of-line comment. The comments might still get mangled, but at least the code would not, and it would be easy to reconstruct the comments from the Unicode. WDYT?
        Hide
        Gary Gregory added a comment -

        That sounds good. Today, the code is not editable/maintainable.

        There does not seem to be anything I can do in Eclipse to fix this just for viewing the chars correctly.

        If the comments are left mangled, then they are not maintainable. If you change the code, then the comment should match. So I would not leave the comments mangled.

        Show
        Gary Gregory added a comment - That sounds good. Today, the code is not editable/maintainable. There does not seem to be anything I can do in Eclipse to fix this just for viewing the chars correctly. If the comments are left mangled, then they are not maintainable. If you change the code, then the comment should match. So I would not leave the comments mangled.
        Sebb made changes -
        Description Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

        This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

        I think we should have a rule of using Unicode escapes for all such non-ascii characters.
        It's particularly important for non-ISO-8859-1 characters.

        Some example classes with non-ascii characters:

        {code}
        binary\Base64Test.java:96 byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
        language\ColognePhoneticTest.java:110 {"m├Ânchengladbach", "664645214"},
        language\ColognePhoneticTest.java:130 String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"},
        language\ColognePhoneticTest.java:143 {"ganz", "Gänse"},
        language\DoubleMetaphoneTest.java:1222 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        language\DoubleMetaphoneTest.java:1227 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:369 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:375 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:389 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:395 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        {code}

        The characters are probably not correct above, because I used a crude perl script to find them:

        {code}
        perl ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
        {code}

        language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

        Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

        if (Character.isLetter('\ufffd'))

        which is an "unknown" character.

        Similarly for binary\Base64Test.java:96.

        It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

        [Possibly the characters got mangled at some point, or maybe they have always been wrong]

        The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)
        Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

        This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

        I think we should have a rule of using Unicode escapes for all such non-ascii characters.
        It's particularly important for non-ISO-8859-1 characters.

        Some example classes with non-ascii characters:

        {code}
        binary\Base64Test.java:96 byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
        language\ColognePhoneticTest.java:110 {"m├Ânchengladbach", "664645214"},
        language\ColognePhoneticTest.java:130 String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"},
        language\ColognePhoneticTest.java:143 {"ganz", "Gänse"},
        language\DoubleMetaphoneTest.java:1222 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        language\DoubleMetaphoneTest.java:1227 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:369 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:375 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:389 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:395 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        {code}

        The characters are probably not correct above, because I used a crude perl script to find them:

        {code}
        perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
        {code}

        language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

        Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

        if (Character.isLetter('\ufffd'))

        which is an "unknown" character.

        Similarly for binary\Base64Test.java:96.

        It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

        [Possibly the characters got mangled at some point, or maybe they have always been wrong]

        The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)
        Hide
        Sebb added a comment -

        If you change Eclipse to set the container / resource / text file encoding to UTF-8 (since that is what the POM says) the files should display correctly assuming they really are UTF-8.

        Show
        Sebb added a comment - If you change Eclipse to set the container / resource / text file encoding to UTF-8 (since that is what the POM says) the files should display correctly assuming they really are UTF-8.
        sebb committed 1157892 (1 file)
        Reviews: none

        CODEC-127 Convert to use Unicode in strings, but add comments in native encoding (utf-8)

        Hide
        Gary Gregory added a comment -

        All better with the test source folder set to UTF-8, which I thought I had done, but obviously not.

        I am now a lot less worried about maintenance because the files are editable given the right editor settings. I am inclined to leave things as is.

        Perhaps each file need a prominent Javadoc about using UTF-8 in editors.

        Show
        Gary Gregory added a comment - All better with the test source folder set to UTF-8, which I thought I had done, but obviously not. I am now a lot less worried about maintenance because the files are editable given the right editor settings. I am inclined to leave things as is. Perhaps each file need a prominent Javadoc about using UTF-8 in editors.
        Hide
        Sebb added a comment -

        See my fix to ColognePhoneticTest in trunk.

        That now shows native comments for all unicode escapes.

        Two of the otherwise lowercase names were previously converted to the Unicode for upper case umlauts; I wonder if that was a mistake?

        Show
        Sebb added a comment - See my fix to ColognePhoneticTest in trunk. That now shows native comments for all unicode escapes. Two of the otherwise lowercase names were previously converted to the Unicode for upper case umlauts; I wonder if that was a mistake?
        Hide
        Gary Gregory added a comment -

        WRT:

        Author: sebb
        Date: Mon Aug 15 15:47:42 2011
        New Revision: 1157892
        
        URL: http://svn.apache.org/viewvc?rev=1157892&view=rev
        Log:
        CODEC-127 Convert to use Unicode in strings, but add comments in native encoding (utf-8)
        

        I am having second thoughts here. If you cannot edit UTF-8, you cannot edit and maintain the files because if you change the Unicode escape in the code, you must change the comment to match. So now, I am favoring leaving the code as it was before...

        Thoughts?

        Show
        Gary Gregory added a comment - WRT: Author: sebb Date: Mon Aug 15 15:47:42 2011 New Revision: 1157892 URL: http://svn.apache.org/viewvc?rev=1157892&view=rev Log: CODEC-127 Convert to use Unicode in strings, but add comments in native encoding (utf-8) I am having second thoughts here. If you cannot edit UTF-8, you cannot edit and maintain the files because if you change the Unicode escape in the code, you must change the comment to match. So now, I am favoring leaving the code as it was before... Thoughts?
        Hide
        Sebb added a comment -

        It's not that one cannot edit UTF-8; the problem is that it is easy to mangle non-ASCII characters by mistake.

        The safest is to only use ASCII, i.e. Unicode escapes, which are valid in both UTF-8 and ISO-8859-1 and all likely default encodings.

        However, they are difficult to read, hence the comments on the lines.
        If the comments get mangled, it will be obvious, because they won't look right; and it's relatively easy to fix them from the Unicode.

        I don't think it's an option to use native characters in the non-comment code, because we already know they can get corrupted, and the corruption won't necessarily cause errors.

        I don't see the harm in "translating" the code into commments; after all the translation can be done again.

        Show
        Sebb added a comment - It's not that one cannot edit UTF-8; the problem is that it is easy to mangle non-ASCII characters by mistake. The safest is to only use ASCII, i.e. Unicode escapes, which are valid in both UTF-8 and ISO-8859-1 and all likely default encodings. However, they are difficult to read, hence the comments on the lines. If the comments get mangled, it will be obvious, because they won't look right; and it's relatively easy to fix them from the Unicode. I don't think it's an option to use native characters in the non-comment code, because we already know they can get corrupted, and the corruption won't necessarily cause errors. I don't see the harm in "translating" the code into commments; after all the translation can be done again.
        Hide
        Gary Gregory added a comment -

        Roger that. I'm sold then.

        Show
        Gary Gregory added a comment - Roger that. I'm sold then.
        sebb committed 1157915 (1 file)
        Reviews: none

        CODEC-127 Convert to use Unicode in strings, but add comments in native encoding (utf-8)

        Hide
        Sebb added a comment -

        Actually, DoubleMetaphoneTest is still corrupt; fixing now.

        Show
        Sebb added a comment - Actually, DoubleMetaphoneTest is still corrupt; fixing now.
        sebb committed 1157936 (1 file)
        sebb committed 1157945 (1 file)
        Reviews: none

        Test is useless unless it actually checks the result!
        [No wonder the corrupted chars were not found]
        See CODEC-127

        sebb committed 1157946 (1 file)
        Reviews: none

        Test is useless unless it actually checks the result!
        [No wonder the corrupted chars were not found - now fixed]
        See CODEC-127

        sebb committed 1157962 (1 file)
        Reviews: none

        CODEC-127 Convert to use Unicode except in comments
        Also simplify test

        sebb committed 1157963 (1 file)
        sebb committed 1157964 (1 file)
        sebb committed 1157965 (1 file)
        Sebb made changes -
        Comment [ Sorry, closing " was in the wrong place; it should have been before the file name params ]
        Sebb made changes -
        Description Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

        This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

        I think we should have a rule of using Unicode escapes for all such non-ascii characters.
        It's particularly important for non-ISO-8859-1 characters.

        Some example classes with non-ascii characters:

        {code}
        binary\Base64Test.java:96 byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
        language\ColognePhoneticTest.java:110 {"m├Ânchengladbach", "664645214"},
        language\ColognePhoneticTest.java:130 String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"},
        language\ColognePhoneticTest.java:143 {"ganz", "Gänse"},
        language\DoubleMetaphoneTest.java:1222 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        language\DoubleMetaphoneTest.java:1227 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:369 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:375 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:389 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:395 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        {code}

        The characters are probably not correct above, because I used a crude perl script to find them:

        {code}
        perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
        {code}

        language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

        Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

        if (Character.isLetter('\ufffd'))

        which is an "unknown" character.

        Similarly for binary\Base64Test.java:96.

        It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

        [Possibly the characters got mangled at some point, or maybe they have always been wrong]

        The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)
        Some of the test cases include characters in a native encoding (possibly UTF-8), rather than using Unicode escapes.

        This can cause a problem for IDEs if they don't know the encoding (e.g. cause compilation errors, which is how I found the issue), and possibly some transformations may corrupt the contents, e.g. fixing EOL.

        I think we should have a rule of using Unicode escapes for all such non-ascii characters.
        It's particularly important for non-ISO-8859-1 characters.

        Some example classes with non-ascii characters:

        {code}
        binary\Base64Test.java:96 byte[] decode = b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
        language\ColognePhoneticTest.java:110 {"m├Ânchengladbach", "664645214"},
        language\ColognePhoneticTest.java:130 String[][] data = {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
        language\ColognePhoneticTest.java:137 {"Meyer", "M├╝ller"},
        language\ColognePhoneticTest.java:143 {"ganz", "Gänse"},
        language\DoubleMetaphoneTest.java:1222 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
        language\DoubleMetaphoneTest.java:1227 this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
        language\SoundexTest.java:367 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:369 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:375 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:387 if (Character.isLetter('´┐¢')) {
        language\SoundexTest.java:389 Assert.assertEquals("´┐¢000", this.getSoundexEncoder().encode("´┐¢"));
        language\SoundexTest.java:395 Assert.assertEquals("", this.getSoundexEncoder().encode("´┐¢"));
        {code}

        The characters are probably not correct above, because I used a crude perl script to find them:

        {code}
        perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" xxxx.java
        {code}

        language\SoundexTest.java:367 in particular is incorrect, because it's supposed to be a single character.

        Now one might think that native2ascii -encoding UTF-8 would fix that, but it gives:

        if (Character.isLetter('\ufffd'))

        which is an "unknown" character.

        Similarly for binary\Base64Test.java:96.

        It's not all that clear what the Unicode escapes should be in these cases, but probably not the unknown character.

        [Possibly the characters got mangled at some point, or maybe they have always been wrong]

        The ColognePhoneticTest.java cases are less serious, as the characters are valid ISO-8859-1 (accented German), but given that the rest of the file uses unicode escaps, I think they should be changed too (but add comments to say what they are, e.g. o-umlaut, u-umlaut)
        Sebb made changes -
        Comment [ Sebb:

        I get errors when I try your perl script on Windows with the latest perl (64 bit) from ActiveState. Rather than use this space to figure out why, can you please run it again and check if we are done with this ticket?

        Thank you,
        Gary ]
        Sebb made changes -
        Comment [ If I run the command as is, I get:
        {quote}
        Can't open perl script "ne": No such file or directory
        {quote} ]
        Sebb made changes -
        Comment [ Typo - missing hyphen for flags ]
        Sebb made changes -
        Comment [ Can you post your .pm here or email to ggregory at apache dot org? ]
        Sebb made changes -
        Comment [ Tried it here; works fine.

        Probably an error in your Wild.pm, because I see the same if I omit the -MWild option. ]
        Sebb made changes -
        Comment [ Arg:
        {noformat}
        C:\svn\org\apache\commons\trunks-proper\codec>perl -MWild -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
        Can't open */*.java: Invalid argument.
        {noformat}
        ]
        Sebb made changes -
        Comment [ Sorry, forgot I was using a local module which handles DOS wildcards, see

        http://docs.activestate.com/activeperl/5.14/lib/pods/perlwin32.html#command_line_wildcard_expansion

        Either pass each file in separately, or create Wild.pm and use:

        {code}
        perl -MWild -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
        {code}

        Wild.pm only works for one level of directories. ]
        Sebb made changes -
        Comment [ Perl:

        I did all that and I get:

        {noformat}
        C:\svn\org\apache\commons\trunks-proper\codec>perl -MWild -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV; */*.java"
        syntax error at -e line 1, near "*."
        Execution of -e aborted due to compilation errors.
        {noformat}

        I also have:

        PERL5OPT=-MWild

        in my environment.

        Gary ]
        Sebb made changes -
        Comment [ If I run:

        {noformat}
        perl -n -e "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
        {noformat}

        I get:
        {noformat}
        Can't open */*.java: Invalid argument.
        {noformat}
        ]
        Hide
        Sebb added a comment -

        I think all the files are now fixed so that the code uses Unicode escapes; the only non-ASCII characters are now in comments.

        Show
        Sebb added a comment - I think all the files are now fixed so that the code uses Unicode escapes; the only non-ASCII characters are now in comments.
        Hide
        Julius Davies added a comment -

        For the Base64 test, I just wanted a test with some characters outside the lower ascii 128. We could do that by casting a char from an int instead if you prefer! (e.g. char c = (char) 129). I don't really care what characters they are.

        Show
        Julius Davies added a comment - For the Base64 test, I just wanted a test with some characters outside the lower ascii 128. We could do that by casting a char from an int instead if you prefer! (e.g. char c = (char) 129). I don't really care what characters they are.
        Hide
        Gary Gregory added a comment -

        So Base64Test is done right?

        Show
        Gary Gregory added a comment - So Base64Test is done right?
        Hide
        Sebb added a comment -

        I think Base64Test is OK - I looked back at the original commits, and found an uncorrupted version.

        By the way, it was only Test files that needed fixing, apart from ColognePhonetic, where the fixes were only needed in comments anyway.

        Show
        Sebb added a comment - I think Base64Test is OK - I looked back at the original commits, and found an uncorrupted version. By the way, it was only Test files that needed fixing, apart from ColognePhonetic, where the fixes were only needed in comments anyway.
        Hide
        Gary Gregory added a comment -

        I'll leave it up to you to click 'Resolve' for this ticket then

        Show
        Gary Gregory added a comment - I'll leave it up to you to click 'Resolve' for this ticket then
        sebb committed 1158810 (1 file)
        sebb committed 1158812 (1 file)
        Hide
        Sebb added a comment -

        Fixes applied to trunk and generics branch

        Show
        Sebb added a comment - Fixes applied to trunk and generics branch
        Sebb made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Gary Gregory made changes -
        Fix Version/s 1.6 [ 12317649 ]
        Affects Version/s 1.5 [ 12315210 ]
        Henri Yandell made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Sebb
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development