Derby
  1. Derby
  2. DERBY-4492

Localized help message from derbyrun.jar has wrong encoding

    Details

    • Bug behavior facts:
      Regression

      Description

      When I change the locale to one of the languages for which we have localized tools messages, either by changing the system locale or by setting the derby.ui.locale property, the output from derbyrun.jar is garbled:

      $ java -Dderby.ui.locale=zh_TW -jar derbyrun.jar
      用法:
      java -jar derbyrun.jar ij [-p propertiesfile] [sql script]
      java -jar derbyrun.jar sysinfo [-cp ...] [-cp help]
      java -jar derbyrun.jar dblook [args] (或是不加引數以查看用法)
      java -jar derbyrun.jar server [args] (或是不加引數以查看用法)
      $ java -Dderby.ui.locale=de_DE -jar derbyrun.jar
      Syntax:
      java -jar derbyrun.jar ij [-p Merkmaldatei] [sql Script]
      java -jar derbyrun.jar sysinfo [-cp ...] [-cp help]
      java -jar derbyrun.jar dblook [Argumente] (oder ohne Argumente für Syntaxinformationen)
      java -jar derbyrun.jar server [Argumente] (oder ohne Argumente für Syntaxinformationen)

      Only the help message from derbyrun.jar is garbled. The other tools appear to produce fine output, even when invoked via derbyrun.jar:

      $ java -Dderby.ui.locale=zh_TW -jar derbyrun.jar sysinfo
      ------------------ Java 資訊 ------------------
      Java 版本: 1.6.0_17
      Java 供應商: Sun Microsystems Inc.
      Java 首頁: /usr/jdk/instances/jdk1.6.0/jre
      Java 類別路徑: derbyrun.jar
      OS 名稱: SunOS
      .
      .
      .

      1. handback.zip
        357 kB
        Rick Hillegas
      2. Escape.java
        2 kB
        Knut Anders Hatlen
      3. unicode_escape.diff
        1.15 MB
        Knut Anders Hatlen
      4. Escape.java
        2 kB
        Knut Anders Hatlen
      5. Escape.java
        5 kB
        Knut Anders Hatlen
      6. unicode_escape_v2.diff
        883 kB
        Knut Anders Hatlen
      7. backport-10.4.stat
        0.9 kB
        Knut Anders Hatlen
      8. backport-10.4.diff
        286 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          This is a regression in 10.5.2.0. Before, the messages looked OK. Here's from 10.5.1.1:

          % java -Dderby.ui.locale=de_DE -jar /code/derby/oldreleases/10.5.1.1/derbyrun.jar
          Syntax:
          java -jar derbyrun.jar ij [-p Merkmaldatei] [sql Script]
          java -jar derbyrun.jar sysinfo [-cp ...] [-cp help]
          java -jar derbyrun.jar dblook [Argumente] (oder ohne Argumente für Syntaxinformationen)
          java -jar derbyrun.jar server [Argumente] (oder ohne Argumente für Syntaxinformationen)

          It looks like the problem is in the message files. Here's the message above from toolsmessages_de_DE.properties in 10.5.3.0:

          RUN_Usage=Syntax\:\njava -jar derbyrun.jar ij [-p Merkmaldatei] [sql Script]\njava -jar derbyrun.jar sysinfo [-cp ...] [-cp help] \njava -jar derbyrun.jar dblook [Argumente] (oder ohne Argumente f\u221A\u00BAr Syntaxinformationen)\njava -jar derbyrun.jar server [Argumente] (oder ohne Argumente f\u221A\u00BAr Syntaxinformationen)

          Note that the word "für" is represented as "f\u221A\u00BAr", that is two characters for the single character "ü", so that it is displayed as "√º".

          Show
          Knut Anders Hatlen added a comment - This is a regression in 10.5.2.0. Before, the messages looked OK. Here's from 10.5.1.1: % java -Dderby.ui.locale=de_DE -jar /code/derby/oldreleases/10.5.1.1/derbyrun.jar Syntax: java -jar derbyrun.jar ij [-p Merkmaldatei] [sql Script] java -jar derbyrun.jar sysinfo [-cp ...] [-cp help] java -jar derbyrun.jar dblook [Argumente] (oder ohne Argumente für Syntaxinformationen) java -jar derbyrun.jar server [Argumente] (oder ohne Argumente für Syntaxinformationen) It looks like the problem is in the message files. Here's the message above from toolsmessages_de_DE.properties in 10.5.3.0: RUN_Usage=Syntax\:\njava -jar derbyrun.jar ij [-p Merkmaldatei] [sql Script] \njava -jar derbyrun.jar sysinfo [-cp ...] [-cp help] \njava -jar derbyrun.jar dblook [Argumente] (oder ohne Argumente f\u221A\u00BAr Syntaxinformationen)\njava -jar derbyrun.jar server [Argumente] (oder ohne Argumente f\u221A\u00BAr Syntaxinformationen) Note that the word "für" is represented as "f\u221A\u00BAr", that is two characters for the single character "ü", so that it is displayed as "√º".
          Hide
          Knut Anders Hatlen added a comment -

          This seems to affect more messages, not only the ones used by derbyrun.jar:

          ij> connect 'jdbc:derby:db;territory=de_DE;create=true';
          ij> create table t(x int);
          0 rows inserted/updated/deleted
          ij> alter table t add constraint c primary key ;
          ERROR 42831: 'X' kann Nullwerte enthalten und daher keine Spalte eines Primärschlüssels sein.

          "Prim√§rschl√ºssels" should have been "Primärschlüssels".

          It looks like all messages touched by revision 793870 (DERBY-4221) have the wrong escape sequence for non-ascii characters. At least it looks so for the languages that I have some minimal knowledge about.

          Show
          Knut Anders Hatlen added a comment - This seems to affect more messages, not only the ones used by derbyrun.jar: ij> connect 'jdbc:derby:db;territory=de_DE;create=true'; ij> create table t(x int); 0 rows inserted/updated/deleted ij> alter table t add constraint c primary key ; ERROR 42831: 'X' kann Nullwerte enthalten und daher keine Spalte eines Prim√§rschl√ºssels sein. "Prim√§rschl√ºssels" should have been "Primärschlüssels". It looks like all messages touched by revision 793870 ( DERBY-4221 ) have the wrong escape sequence for non-ascii characters. At least it looks so for the languages that I have some minimal knowledge about.
          Hide
          Myrna van Lunteren added a comment -

          That sounds, unfortunately, like the messages for that error were generated using an incorrect encoding.

          I believe they need to be generated with UTF8...

          Which languages did you look at specifically? German, and?

          I believe the translations were contributed by Rick. Perhaps it's possible to arrange for regeneration with the correct encoding?

          Is this problem large enough to warrant a new 10.5 release?

          Show
          Myrna van Lunteren added a comment - That sounds, unfortunately, like the messages for that error were generated using an incorrect encoding. I believe they need to be generated with UTF8... Which languages did you look at specifically? German, and? I believe the translations were contributed by Rick. Perhaps it's possible to arrange for regeneration with the correct encoding? Is this problem large enough to warrant a new 10.5 release?
          Hide
          Knut Anders Hatlen added a comment -

          > Which languages did you look at specifically? German, and?

          I've found problems with the messages in these locales:

          de_DE
          fr_FR
          zh_TW
          zh_CN
          ja_JP

          (No, I don't know Chinese or Japanese, but I know what Chinese and Japanese characters don't look like...)

          I suspect there are problems with all the locales touched by DERBY-4221, but those are the only languages I have checked explicitly.

          For French and German, the problem isn't that severe. The messages look odd, but they are still possible to understand since most characters are ascii anyway. For languages that uses mostly non-ascii characters, I would believe the garbled messages are completely incomprehensible, though.

          Show
          Knut Anders Hatlen added a comment - > Which languages did you look at specifically? German, and? I've found problems with the messages in these locales: de_DE fr_FR zh_TW zh_CN ja_JP (No, I don't know Chinese or Japanese, but I know what Chinese and Japanese characters don't look like...) I suspect there are problems with all the locales touched by DERBY-4221 , but those are the only languages I have checked explicitly. For French and German, the problem isn't that severe. The messages look odd, but they are still possible to understand since most characters are ascii anyway. For languages that uses mostly non-ascii characters, I would believe the garbled messages are completely incomprehensible, though.
          Hide
          Knut Anders Hatlen added a comment -

          This problem also appears to affect the messages touched for 10.4 (DERBY-3804). The messages for 10.3 and earlier look fine, as far as I can tell.

          Now, the garbling seems to be different in 10.4 and 10.5.

          In 10.4, a word such as "Schlüssel" would be encoded as "Schl\u00C3\u00BCssel", whereas it should have been "Schl\u00FCssel". Here, the problem seems obvious: "ü" has the codepoint 0xFC, and should therefore have the unicode escape sequence \u00FC. However, the UTF-8 encoding of ü is

          {0xC3, 0xBC}, and it looks like each byte in the UTF-8 encoded sequence is inserted as a separate codepoint. That is, ü --> {0xC3, 0xBC}

          --> \u00C3\u00BC --> ü. It should be fairly easy to write a script that goes through the original patch and fixes up this.

          In 10.5, I have a harder time seeing what's going on. There, the character ü (0xFC) is escaped as \u221A\u00BA, and ö (0xF6) as \u221A\u2202. I fail to see a pattern here.

          Show
          Knut Anders Hatlen added a comment - This problem also appears to affect the messages touched for 10.4 ( DERBY-3804 ). The messages for 10.3 and earlier look fine, as far as I can tell. Now, the garbling seems to be different in 10.4 and 10.5. In 10.4, a word such as "Schlüssel" would be encoded as "Schl\u00C3\u00BCssel", whereas it should have been "Schl\u00FCssel". Here, the problem seems obvious: "ü" has the codepoint 0xFC, and should therefore have the unicode escape sequence \u00FC. However, the UTF-8 encoding of ü is {0xC3, 0xBC}, and it looks like each byte in the UTF-8 encoded sequence is inserted as a separate codepoint. That is, ü --> {0xC3, 0xBC} --> \u00C3\u00BC --> ü. It should be fairly easy to write a script that goes through the original patch and fixes up this. In 10.5, I have a harder time seeing what's going on. There, the character ü (0xFC) is escaped as \u221A\u00BA, and ö (0xF6) as \u221A\u2202. I fail to see a pattern here.
          Hide
          Knut Anders Hatlen added a comment -

          I think I can see what happened in 10.5 now. I came across the character table for Mac OS Roman encoding here: http://smontagu.damowmow.com/genEncodingTest.cgi?family=apple&codepage=roman

          Now, "ö" has the codepoint 0xF6, which is encoded in UTF-8 as

          {0xC3, 0xB6}

          . If you decode this byte sequence using the above mentioned character table, you get the character sequence "√∂", which has the unicode escape sequence \u221A\u2202. This is the same escape sequence used to represent ö in the 10.5 messages.

          Same goes for "ü": Codepoint 0xFC --> UTF-8

          {0xC3, 0xBC}

          --> decoded using character table to "ü", whose unicode escape is \u221A\u00BA.

          I think this should be a fairly easy scripting task to fix as well.

          Show
          Knut Anders Hatlen added a comment - I think I can see what happened in 10.5 now. I came across the character table for Mac OS Roman encoding here: http://smontagu.damowmow.com/genEncodingTest.cgi?family=apple&codepage=roman Now, "ö" has the codepoint 0xF6, which is encoded in UTF-8 as {0xC3, 0xB6} . If you decode this byte sequence using the above mentioned character table, you get the character sequence "√∂", which has the unicode escape sequence \u221A\u2202. This is the same escape sequence used to represent ö in the 10.5 messages. Same goes for "ü": Codepoint 0xFC --> UTF-8 {0xC3, 0xBC} --> decoded using character table to "√º", whose unicode escape is \u221A\u00BA. I think this should be a fairly easy scripting task to fix as well.
          Hide
          Rick Hillegas added a comment -

          I am attaching handback.zip. These are the 10.5 message files in UTF-8, as I received them from the translators. These are the steps of the process I followed:

          1) First I converted our messages to UTF-8 because that is the format which the translators required.

          2) Then the translators made the necessary changes.

          3) Then I converted the changed files back to the encoding which Derby uses.

          At one point I considered changing the encoding on all of our message files to UTF-8. I don't know why we are using our current encoding. I don't know what will break if we do this.

          But if the files in handback.zip look good to you, it might be worthwhile checking them in and then changing the encoding on the remaining message files to UTF-8.

          On the other hand, this might not fix the problem if the errors were introduced in step (1) rather than in step (3).

          Show
          Rick Hillegas added a comment - I am attaching handback.zip. These are the 10.5 message files in UTF-8, as I received them from the translators. These are the steps of the process I followed: 1) First I converted our messages to UTF-8 because that is the format which the translators required. 2) Then the translators made the necessary changes. 3) Then I converted the changed files back to the encoding which Derby uses. At one point I considered changing the encoding on all of our message files to UTF-8. I don't know why we are using our current encoding. I don't know what will break if we do this. But if the files in handback.zip look good to you, it might be worthwhile checking them in and then changing the encoding on the remaining message files to UTF-8. On the other hand, this might not fix the problem if the errors were introduced in step (1) rather than in step (3).
          Hide
          Knut Anders Hatlen added a comment -

          Thanks Rick. The files in handback.zip appear to be correct, so I'll see if I can redo step (3), and then go on to fixing up the 10.4 messages. Changing the encoding of the files to UTF-8 sounds appealing, but I think I'll go for the minimal change for now.

          Show
          Knut Anders Hatlen added a comment - Thanks Rick. The files in handback.zip appear to be correct, so I'll see if I can redo step (3), and then go on to fixing up the 10.4 messages. Changing the encoding of the files to UTF-8 sounds appealing, but I think I'll go for the minimal change for now.
          Hide
          Knut Anders Hatlen added a comment -

          It looks like the files in handback.zip also fix the messages that were touched in 10.4 (haven't checked them all yet, but those I have checked look correct).

          Show
          Knut Anders Hatlen added a comment - It looks like the files in handback.zip also fix the messages that were touched in 10.4 (haven't checked them all yet, but those I have checked look correct).
          Hide
          Knut Anders Hatlen added a comment -

          I used the attached java class to regenerate the message files and created a diff against the messages on trunk. I haven't gone through the messages yet to verify that all the changes look sensible, but at least the output from the commands in the bug description looks fine.

          Show
          Knut Anders Hatlen added a comment - I used the attached java class to regenerate the message files and created a diff against the messages on trunk. I haven't gone through the messages yet to verify that all the changes look sensible, but at least the output from the commands in the bug description looks fine.
          Hide
          Knut Anders Hatlen added a comment -

          Another issue is that long messages (like the IJ help message) are truncated. These messages appear to be correct in the handback.zip file, and in the attached patch.

          Show
          Knut Anders Hatlen added a comment - Another issue is that long messages (like the IJ help message) are truncated. These messages appear to be correct in the handback.zip file, and in the attached patch.
          Hide
          Knut Anders Hatlen added a comment -

          I'm uploading a new version of the program I used to convert the UTF-8 files to escaped ASCII files. This version also escapes tab characters as \t so that the diff becomes smaller. I will regenerate the files and upload a new patch.

          It looks like the messages in handback.zip are ordered slightly different from the messages in the source, which also makes the patch somewhat bigger than necessary. I'll try to fix this in the new version of the patch.

          Show
          Knut Anders Hatlen added a comment - I'm uploading a new version of the program I used to convert the UTF-8 files to escaped ASCII files. This version also escapes tab characters as \t so that the diff becomes smaller. I will regenerate the files and upload a new patch. It looks like the messages in handback.zip are ordered slightly different from the messages in the source, which also makes the patch somewhat bigger than necessary. I'll try to fix this in the new version of the patch.
          Hide
          Knut Anders Hatlen added a comment -

          As to why the message files are not UTF-8 encoded in the first place, I think this is because Properties.load(InputStream) is specified to take a stream encoded in ISO-8859-1. There is a method Properties.load(Reader) that could be used to read files with any encoding, but it is only available on Java 1.6 and later.

          Show
          Knut Anders Hatlen added a comment - As to why the message files are not UTF-8 encoded in the first place, I think this is because Properties.load(InputStream) is specified to take a stream encoded in ISO-8859-1. There is a method Properties.load(Reader) that could be used to read files with any encoding, but it is only available on Java 1.6 and later.
          Hide
          Knut Anders Hatlen added a comment -

          Attached is a new version of the class that converts the UTF-8 encoded files to the escaped format that Derby understands. It now uses a custom comparator to get the same ordering as the original message files, and it also performs some additional washing of the input to prevent unnecessary diffs.

          A new patch generated from the latest version of the converter is also attached. I intend to commit that patch if all the regression tests pass.

          Show
          Knut Anders Hatlen added a comment - Attached is a new version of the class that converts the UTF-8 encoded files to the escaped format that Derby understands. It now uses a custom comparator to get the same ordering as the original message files, and it also performs some additional washing of the input to prevent unnecessary diffs. A new patch generated from the latest version of the converter is also attached. I intend to commit that patch if all the regression tests pass.
          Hide
          Knut Anders Hatlen added a comment -

          All the regression tests passed.

          Show
          Knut Anders Hatlen added a comment - All the regression tests passed.
          Hide
          Knut Anders Hatlen added a comment -

          Committed revision 897161.

          I also intend to check in the fix on the 10.5 branch before I close this issue.

          Show
          Knut Anders Hatlen added a comment - Committed revision 897161. I also intend to check in the fix on the 10.5 branch before I close this issue.
          Hide
          Knut Anders Hatlen added a comment -

          Merged to 10.5 and committed revision 898704.

          Show
          Knut Anders Hatlen added a comment - Merged to 10.5 and committed revision 898704.
          Hide
          Knut Anders Hatlen added a comment -

          Here's a patch that fixes the encoding on 10.4. To create this patch I followed these steps:

          1) Ran a script on the original localization patch that was applied to 10.4 (revision 682388) and replaced the broken escape sequences with the correct ones (e.g., \u00C3\u00A4 -> \u00E4)

          2) Applied the modified patch to sandbox with the 10.4 branch on revision 682387, which is right before the original 10.4 localization patch was checked in

          3) Updated the sandbox to revision 682388, resolving all conflicts by choosing my changes

          4) Updated the sandbox all the way to head of 10.4

          5) svn diff > backport-10.4.diff

          All the regression tests ran cleanly (well, almost... one failure in derbyall because of DERBY-4418).

          Show
          Knut Anders Hatlen added a comment - Here's a patch that fixes the encoding on 10.4. To create this patch I followed these steps: 1) Ran a script on the original localization patch that was applied to 10.4 (revision 682388) and replaced the broken escape sequences with the correct ones (e.g., \u00C3\u00A4 -> \u00E4) 2) Applied the modified patch to sandbox with the 10.4 branch on revision 682387, which is right before the original 10.4 localization patch was checked in 3) Updated the sandbox to revision 682388, resolving all conflicts by choosing my changes 4) Updated the sandbox all the way to head of 10.4 5) svn diff > backport-10.4.diff All the regression tests ran cleanly (well, almost... one failure in derbyall because of DERBY-4418 ).
          Hide
          Knut Anders Hatlen added a comment -

          Here's a sample from 10.4. With 10.4.2.0:

          kah@ugle:~ % java -Dderby.ui.locale=fr_FR -jar /code/derby/oldreleases/10.4.2.0/derbynet.jar
          DRDA_NoCommand.U
          Syntaxe : NetworkServerControl <commandes>
          Commandes :
          start [-h <hôte>] [-p <numéro_port>] [-noSecurityManager] [-ssl <mode_ssl>]
          shutdown [-h <hôte>][-p <numéro_port>] [-ssl <mode_ssl>] [-user <nom_utilisateur>] [-password <mot_de_passe>]
          ...

          Notice the wrong encoding for hôte and numéro_port on the shutdown line.

          With patched 10.4:

          kah@ugle:~ % java -Dderby.ui.locale=fr_FR -jar /code/derby/10.4/jars/sane/derbynet.jar
          DRDA_NoCommand.U
          Syntaxe : NetworkServerControl <commandes>
          Commandes :
          start [-h <hôte>] [-p <numéro_port>] [-noSecurityManager] [-ssl <mode_ssl>]
          shutdown [-h <hôte>][-p <numéro_port>] [-ssl <mode_ssl>] [-user <nom_utilisateur>] [-password <mot_de_passe>]
          ...

          Show
          Knut Anders Hatlen added a comment - Here's a sample from 10.4. With 10.4.2.0: kah@ugle:~ % java -Dderby.ui.locale=fr_FR -jar /code/derby/oldreleases/10.4.2.0/derbynet.jar DRDA_NoCommand.U Syntaxe : NetworkServerControl <commandes> Commandes : start [-h <hôte>] [-p <numéro_port>] [-noSecurityManager] [-ssl <mode_ssl>] shutdown [-h <hôte>] [-p <numéro_port>] [-ssl <mode_ssl>] [-user <nom_utilisateur>] [-password <mot_de_passe>] ... Notice the wrong encoding for hôte and numéro_port on the shutdown line. With patched 10.4: kah@ugle:~ % java -Dderby.ui.locale=fr_FR -jar /code/derby/10.4/jars/sane/derbynet.jar DRDA_NoCommand.U Syntaxe : NetworkServerControl <commandes> Commandes : start [-h <hôte>] [-p <numéro_port>] [-noSecurityManager] [-ssl <mode_ssl>] shutdown [-h <hôte>] [-p <numéro_port>] [-ssl <mode_ssl>] [-user <nom_utilisateur>] [-password <mot_de_passe>] ...
          Hide
          Knut Anders Hatlen added a comment -

          Committed backport-10.4.diff to the 10.4 branch with revision 899582.

          Show
          Knut Anders Hatlen added a comment - Committed backport-10.4.diff to the 10.4 branch with revision 899582.
          Hide
          Myrna van Lunteren added a comment -

          Perhaps we should add a test to prevent this from happening in the future...I'll think on it. If anyone has some suggestions?

          Show
          Myrna van Lunteren added a comment - Perhaps we should add a test to prevent this from happening in the future...I'll think on it. If anyone has some suggestions?

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development