Derby
  1. Derby
  2. DERBY-728

Unable to create databases whose name containg Chinese characters through the client driver

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.1.2.1
    • Fix Version/s: 10.7.1.1
    • Component/s: Network Client
    • Labels:
    • Environment:
      Debian unstable, LInux 2.6.14.2, libc 2.3.5-6
    • Urgency:
      Normal
    • Issue & fix info:
      High Value Fix, Repro attached
    • Bug behavior facts:
      Embedded/Client difference

      Description

      Trying to create a database with the following URL (note the Chinese character in the database name):

      jdbc:derby://localhost:1527/\u4e10;create=true

      throws the following exception:

      ----%<----
      Exception in thread "main" org.apache.derby.client.am.SqlException: Unicode string can't convert to Ebcdic string
      at org.apache.derby.client.net.EbcdicCcsidManager.convertFromUCS2(Unknown Source)
      at org.apache.derby.client.net.Request.writeScalarPaddedString(Unknown Source)
      at org.apache.derby.client.net.NetConnectionRequest.buildRDBNAM(Unknown Source)
      at org.apache.derby.client.net.NetConnectionRequest.buildACCSEC(Unknown Source)
      at org.apache.derby.client.net.NetConnectionRequest.writeAccessSecurity(Unknown Source)
      at org.apache.derby.client.net.NetConnection.writeServerAttributesAndKeyExchange(Unknown Source)
      at org.apache.derby.client.net.NetConnection.flowServerAttributesAndKeyExchange(Unknown Source)
      at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(Unknown Source)
      at org.apache.derby.client.net.NetConnection.flowConnect(Unknown Source)
      at org.apache.derby.client.net.NetConnection.<init>(Unknown Source)
      at org.apache.derby.jdbc.ClientDriver.connect(Unknown Source)
      at java.sql.DriverManager.getConnection(DriverManager.java:525)
      at java.sql.DriverManager.getConnection(DriverManager.java:193)
      at jdbctest.Main.main(Main.java:33)
      ----%<----

      It's possible, however, to create databases using the embedded driver, using an URL like:

      jdbc:derby:\u4e10;create=true

      Tested with both 10.1.1.0 and 10.1.2.1 with the same result.

      Complete code to reproduce the bug:

      ----%<----
      public static void main(String[] args) throws Exception

      { Class.forName("org.apache.derby.jdbc.ClientDriver"); Connection conn = DriverManager.getConnection("jdbc:derby://localhost:1527/\u4e10;create=true"); }

      ----%<----

      1. derby-728-startingpoint.diff
        186 kB
        Dyre Tjeldvoll
      2. derby-728_proto_diff.txt
        34 kB
        Kathey Marsden
      3. ACR7007.pdf
        863 kB
        Kathey Marsden
      4. BigTableName.java
        4 kB
        Rick Hillegas
      5. DERBY-728_p1.diff
        12 kB
        Tiago R. Espinha
      6. DERBY-728_p2.diff
        13 kB
        Tiago R. Espinha
      7. DERBY-728_p2.diff
        14 kB
        Tiago R. Espinha
      8. DERBY-728_p2-test.diff
        6 kB
        Tiago R. Espinha
      9. DERBY-728_p2.diff
        16 kB
        Tiago R. Espinha
      10. DERBY-728_p2-test.diff
        6 kB
        Tiago R. Espinha
      11. DERBY-728_p2.diff
        18 kB
        Tiago R. Espinha
      12. DERBY-728_p2-test.diff
        5 kB
        Tiago R. Espinha
      13. DERBY-728_p2-test.diff
        1 kB
        Tiago R. Espinha
      14. DERBY_728_p2_sanity.diff
        2 kB
        Tiago R. Espinha
      15. DERBY_728_p3.diff
        4 kB
        Tiago R. Espinha

        Issue Links

          Activity

          Hide
          Rick Hillegas added a comment -

          Quite likely, some common bug underlies this problem and Derby-708.

          Show
          Rick Hillegas added a comment - Quite likely, some common bug underlies this problem and Derby-708.
          Hide
          Bernt M. Johnsen added a comment -

          This is not related to Derby-708. This works in the embedded driver, but the path is encoded in EBCDIC (default DRDA encoding) between network client and server, and thus a character set equivalent to ISO-8859-1 is supported. This does not include chinese characters.

          Show
          Bernt M. Johnsen added a comment - This is not related to Derby-708. This works in the embedded driver, but the path is encoded in EBCDIC (default DRDA encoding) between network client and server, and thus a character set equivalent to ISO-8859-1 is supported. This does not include chinese characters.
          Hide
          Myrna van Lunteren added a comment -

          I ran into another occurrence of this while looking at the tests databasePermissions.java and databasePermissions_net.java. The 'embedded' version is made to run some subcases using Greek characters as username and password, networkserver/derbynetclient doesn't.
          When you attempt to modify this so network server uses the user name and password strings with non-ascii (Greek, in this case) characters you bump into the same error (22005.S.3 / CANT_CONVERT_UNICODE_TO_EBCDIC) in org.apache.derby.client.net.EbcdicCcsidManager.convertFromUCS2.

          Show
          Myrna van Lunteren added a comment - I ran into another occurrence of this while looking at the tests databasePermissions.java and databasePermissions_net.java. The 'embedded' version is made to run some subcases using Greek characters as username and password, networkserver/derbynetclient doesn't. When you attempt to modify this so network server uses the user name and password strings with non-ascii (Greek, in this case) characters you bump into the same error (22005.S.3 / CANT_CONVERT_UNICODE_TO_EBCDIC) in org.apache.derby.client.net.EbcdicCcsidManager.convertFromUCS2.
          Hide
          Kathey Marsden added a comment -

          Bernt said in this issue:
          "... the path is encoded in EBCDIC (default DRDA encoding) between network client and server, and thus a character set equivalent to ISO-8859-1 is supported. This does not include chinese characters."

          Are we locked into EBCDIC for this initial negotiation? I believe for data transfer etc, we use UTF-8. Could we change future versions to use UTF-8 for negotiating the databasename, user and password, or would that be in violation of the standard?

          Thanks

          Kathey

          Show
          Kathey Marsden added a comment - Bernt said in this issue: "... the path is encoded in EBCDIC (default DRDA encoding) between network client and server, and thus a character set equivalent to ISO-8859-1 is supported. This does not include chinese characters." Are we locked into EBCDIC for this initial negotiation? I believe for data transfer etc, we use UTF-8. Could we change future versions to use UTF-8 for negotiating the databasename, user and password, or would that be in violation of the standard? Thanks Kathey
          Hide
          Kathey Marsden added a comment -

          I did a quick prototype of Network server/client using a UTF8CcsidManager and was able to connect to the chinese database, but I don't see how we could implement this change and maintain compatibility with earlier versions. All of the exchange of server attributes is done in the CCSID manager encoding so we couldn't negotiate which CcsidManager to use. Are we just stuck until version 11 or does someone have an idea how to manage this?

          Show
          Kathey Marsden added a comment - I did a quick prototype of Network server/client using a UTF8CcsidManager and was able to connect to the chinese database, but I don't see how we could implement this change and maintain compatibility with earlier versions. All of the exchange of server attributes is done in the CCSID manager encoding so we couldn't negotiate which CcsidManager to use. Are we just stuck until version 11 or does someone have an idea how to manage this?
          Hide
          Mamta A. Satoor added a comment -

          I was wondering that until we have a long term solution in version 11, should we provide support for this in earlier versions by having a property which can be used to indiciate UTF8 rather than EBCDIC? Just a suggestion.

          Show
          Mamta A. Satoor added a comment - I was wondering that until we have a long term solution in version 11, should we provide support for this in earlier versions by having a property which can be used to indiciate UTF8 rather than EBCDIC? Just a suggestion.
          Hide
          Kathey Marsden added a comment -

          I know in general the community has avoided such properties, but perhaps it is justified in this case since there seems to be no other alternative. For client we would need to use a system property, since there is no derby.properties file for the client side. It seems a little messy to me, but the upside is that it would be a change that we could backport, so users could benefit immediately.

          Show
          Kathey Marsden added a comment - I know in general the community has avoided such properties, but perhaps it is justified in this case since there seems to be no other alternative. For client we would need to use a system property, since there is no derby.properties file for the client side. It seems a little messy to me, but the upside is that it would be a change that we could backport, so users could benefit immediately.
          Hide
          Kathey Marsden added a comment -

          I noticed in the drda spec:
          4.3.1.13 CCSID Manager
          The CCSIDMGR allows the specification of a single-byte character set CCSID to be associated
          with character typed parameters on DDM command and DDM reply messages. The CCSID
          manager level of the application requester is sent on the EXCSAT command and specifies the
          CCSID that the application requester will use when sending character command parameters.

          So on the one hand it looks like there is the facility to specify a CCSID to use. I think it would be 1208 for UTF-8, but it explicitly says it should be a single-byte character set, so perhaps using UTF-8 is not in compliance with DRDA. But later in an example it mentions a server with CCSID 1208.

          4.3.5.2 Examples
          Below is a simple example of intermediate server processing. The example assumes that each
          server has a different CCSID, indicated in the diagram generically as ebcdic, unicode, or ascii. In
          the figure, the upstream requester has an EBCDIC CCSID (such as CCSID 37), the intermediate
          server has a Unicode CCSID (such as CCSID 1208)...

          So I am not totally sure whether it is legal or not.

          Show
          Kathey Marsden added a comment - I noticed in the drda spec: 4.3.1.13 CCSID Manager The CCSIDMGR allows the specification of a single-byte character set CCSID to be associated with character typed parameters on DDM command and DDM reply messages. The CCSID manager level of the application requester is sent on the EXCSAT command and specifies the CCSID that the application requester will use when sending character command parameters. So on the one hand it looks like there is the facility to specify a CCSID to use. I think it would be 1208 for UTF-8, but it explicitly says it should be a single-byte character set, so perhaps using UTF-8 is not in compliance with DRDA. But later in an example it mentions a server with CCSID 1208. 4.3.5.2 Examples Below is a simple example of intermediate server processing. The example assumes that each server has a different CCSID, indicated in the diagram generically as ebcdic, unicode, or ascii. In the figure, the upstream requester has an EBCDIC CCSID (such as CCSID 37), the intermediate server has a Unicode CCSID (such as CCSID 1208)... So I am not totally sure whether it is legal or not.
          Hide
          Stan Bradbury added a comment - - edited

          Here's a possible workaround for this problem that bypasses DRDA compliance issues. It's not elegant but makes it possible to address this issue without changing the DRDA protocol. The idea is supporting database path aliases as a property. This may have other benefits / uses as well. Your thoughts please.

          Idea:
          I know that some databases [maybe even DB2? DB2 uses DRDA and does not encounter this problem] store location (host / port / path / name [or alias]) information in a file. Could we implement something similar (database names/aliases) using derby properties that can be read by Network server from the derby.properties file to resolve database names on the connection URL? For the server-side you would only, I think, need to specify the absolute or relative path to the database.

          This seems innocuous enough a feature that we could backport to the older codelines to resolve this issue? It would only impact existing implementations that chose to set the property.

          It struck me that something like what is done for the USER property might work well:
          derby.dbalias.myDbAlias=<Yada-Path>/realDbName
          And the connection URL would list only 'myDbAlias'?? I guess this would have to override derby.system.home and be overridded if the conneciton URL looked like a PATH? There are probably other issues I have not thought of.

          Show
          Stan Bradbury added a comment - - edited Here's a possible workaround for this problem that bypasses DRDA compliance issues. It's not elegant but makes it possible to address this issue without changing the DRDA protocol. The idea is supporting database path aliases as a property. This may have other benefits / uses as well. Your thoughts please. Idea: I know that some databases [maybe even DB2? DB2 uses DRDA and does not encounter this problem] store location (host / port / path / name [or alias] ) information in a file. Could we implement something similar (database names/aliases) using derby properties that can be read by Network server from the derby.properties file to resolve database names on the connection URL? For the server-side you would only, I think, need to specify the absolute or relative path to the database. This seems innocuous enough a feature that we could backport to the older codelines to resolve this issue? It would only impact existing implementations that chose to set the property. It struck me that something like what is done for the USER property might work well: derby.dbalias.myDbAlias=<Yada-Path>/realDbName And the connection URL would list only 'myDbAlias'?? I guess this would have to override derby.system.home and be overridded if the conneciton URL looked like a PATH? There are probably other issues I have not thought of.
          Hide
          Kathey Marsden added a comment -

          The Properties javadoc says of load ...

          "the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence."

          So users should be able to specify Chinese characters in the properties file with some effort. So, this might be an option if we can't find a DRDA solution. I would be interested to hear from users whether this a workable option. I can imagine the characters being in an install path and as part of the install the application would need to generate a derby.properties file with the aliases which might be a pain, but I am just speculating on usage.

          Show
          Kathey Marsden added a comment - The Properties javadoc says of load ... "the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence." So users should be able to specify Chinese characters in the properties file with some effort. So, this might be an option if we can't find a DRDA solution. I would be interested to hear from users whether this a workable option. I can imagine the characters being in an install path and as part of the install the application would need to generate a derby.properties file with the aliases which might be a pain, but I am just speculating on usage.
          Hide
          Dyre Tjeldvoll added a comment -

          Just one quick comment here. I once started looking at this issue and I i got some way towards a solution. My impression from looking at the DRDA spec is that it may not be THEORETICALLY possible (because of limitations such as those mentioned by Kathey), but in PRACTICE there should be possible to extend Derby's driver to use a different encoding and fall back to the old behavior when connected to an old server. It should be simple to make the server compatible with the new Derby client, old Derby client, DB2 client or the ODBC client.

          I think I still have the work that I did on this lying around. I can try to make a patch of it, but I can't guarantee that the patch can be made against the latest trunk. That patch would by no means be a polished solution, but could perhaps be a starting point...

          Show
          Dyre Tjeldvoll added a comment - Just one quick comment here. I once started looking at this issue and I i got some way towards a solution. My impression from looking at the DRDA spec is that it may not be THEORETICALLY possible (because of limitations such as those mentioned by Kathey), but in PRACTICE there should be possible to extend Derby's driver to use a different encoding and fall back to the old behavior when connected to an old server. It should be simple to make the server compatible with the new Derby client, old Derby client, DB2 client or the ODBC client. I think I still have the work that I did on this lying around. I can try to make a patch of it, but I can't guarantee that the patch can be made against the latest trunk. That patch would by no means be a polished solution, but could perhaps be a starting point...
          Hide
          Dyre Tjeldvoll added a comment -

          Here is a patch containing the work I had in my sandbox. It is made against
          rev 660997. Updating to the current trunk produced one conflict in DRDAConnThread.java that dodn't look too bad but I'm not sure how to resolve it...

          Show
          Dyre Tjeldvoll added a comment - Here is a patch containing the work I had in my sandbox. It is made against rev 660997. Updating to the current trunk produced one conflict in DRDAConnThread.java that dodn't look too bad but I'm not sure how to resolve it...
          Hide
          Kathey Marsden added a comment -

          Thanks Dyre,

          I would be very interested to see your work on this issue.

          I too am working on a prototype to negotiate the encoding in the MGRLVLS during EXCSAT and am having good results connecting old server with new client and new server with old client. It ends up that the local DRDA experts are working on a preliminary ACR to do this. I should be able to post my prototype and the details of the ACR in January.

          Show
          Kathey Marsden added a comment - Thanks Dyre, I would be very interested to see your work on this issue. I too am working on a prototype to negotiate the encoding in the MGRLVLS during EXCSAT and am having good results connecting old server with new client and new server with old client. It ends up that the local DRDA experts are working on a preliminary ACR to do this. I should be able to post my prototype and the details of the ACR in January.
          Hide
          Dyre Tjeldvoll added a comment -

          Hi Kathey,

          my work on this issue should be in the patch. When working on it, I didn't really aim for a minimal solution, so it may be possible to achieve this with less work than what my patch suggests.

          Based on what I can remember it seems like you are on the right track. Once the negotiating is working you just need to go through the code to find all the places where it is assumed that a given string length will fit in the same number of bytes. That isn't too bad as I recall, but there are some sticky points where a string, encoded in the negotiated charset, should be placed in DRDA fields which are specified with a fixed BYTE size, and should be padded to the correct size with space chars. This is simple as long as you can rely on the space character fitting in one byte, but when supporting encodings such as UCS-2 where space is two bytes, you can no longer use Arrays.fill() directly...

          I also think there is a DRDA field where you are supposed to fill in the client's ip-address (or part of it) as a string in the negotiated charset, and that the number of bytes you can use is not enough to encode in UCS-2. But I think I concluded that this field is not used for anything important, so deviating from the spec here should not be a problem...

          Show
          Dyre Tjeldvoll added a comment - Hi Kathey, my work on this issue should be in the patch. When working on it, I didn't really aim for a minimal solution, so it may be possible to achieve this with less work than what my patch suggests. Based on what I can remember it seems like you are on the right track. Once the negotiating is working you just need to go through the code to find all the places where it is assumed that a given string length will fit in the same number of bytes. That isn't too bad as I recall, but there are some sticky points where a string, encoded in the negotiated charset, should be placed in DRDA fields which are specified with a fixed BYTE size, and should be padded to the correct size with space chars. This is simple as long as you can rely on the space character fitting in one byte, but when supporting encodings such as UCS-2 where space is two bytes, you can no longer use Arrays.fill() directly... I also think there is a DRDA field where you are supposed to fill in the client's ip-address (or part of it) as a string in the negotiated charset, and that the number of bytes you can use is not enough to encode in UCS-2. But I think I concluded that this field is not used for anything important, so deviating from the spec here should not be a problem...
          Hide
          Kathey Marsden added a comment -

          I have been talking with the local DRDA experts and found out that they are working on a new DRDA ACR for a UNICODEMGR which I think can help us with this issue. The short summary is this:

          EXCSAT and ACCSEC are always sent EBCDIC. As part of the MGRLVL exchange, client sends UNICODEMGR 1208 which means that it is requesting all additional DDM parameters will be exchanged in code page 1208. The server responds with UNICODEMGR 1208 if it can accommodate the request. Otherwise it responds with UNICODEMGR 0 and all DDM parameters will continue to be exchanged in EBCDIC.

          One problem with this approach is that ACCSEC currently has an RDBNAM parameter which we treat as required (the spec lists it as optional) and that has to be sent EBCDIC. So, my proposal is that we use the UNICODEMGR and we send RDBNAM on ACCSEC only if the EBCDIC conversion can be done. If the conversion can't be done, we send no RDBNAM on ACCSEC. Then we change the server to use the RDBNAM sent with the unicode SECCHK instead of the one sent on ACCSEC. (Currently we just verify that the SECCHK RDBNAM is the same as the one that was sent on ACCSEC.)

          For an old client working with a new server, there will be no regression and the error message on sending a database name with international characters will be the same as currently listed in
          DERBY-728.

          For a new client working with a old server, this will mean that all the cases that currently pass will still pass, but if a nonconvertible database name is sent (e,g, one with Chinese
          characters) , the server will send back a SYNTAXRM and the
          server console will show:
          = 2110; Error Code Value = e. Plaintext connection attempt
          from an SSL enabled client?
          org.apache.derby.impl.drda.DRDAProtocolException: Execution
          failed because of a Distributed Protocol Error: DRDA_Proto_
          SYNTAXRM; CODPNT arg = 2110; Error Code Value = e. Plaintext
          connection attempt from an SSL enabled client?
          at
          org.apache.derby.impl.drda.DRDAConnThread.throwSyntaxrm(DRDAConn
          Thread.java:513)
          at
          org.apache.derby.impl.drda.DRDAConnThread.missingCodePoint(DRDAC
          onnThread.java:543)
          at
          org.apache.derby.impl.drda.DRDAConnThread.parseACCSEC(DRDAConnTh
          read.java:1948)
          at
          org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDACo
          nnThread.java:943)
          at
          org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.jav
          a:290)

          The client could intercept the SYNTAXRM and knowing it was unable to convert the RDBNAM to EBCDIC could throw the same message it does now. The only regression would be that users attempting to send an invalid database name would now see the server side protocol error occur. I think it is an acceptable regression, since it won't cause any working cases to fail even with mixed revision server/client and it will enable us to move forward and have internationalized database name, user and password.

          I prototyped the change and it all seemed to work ok. I'll attach the prototype patch.

          I would like to implement as much as possible of this for 10.5, but since approval of the ACR by opengroup won't happen by the time we release 10.5, I propose to make the implementation dependent on a client system property derby.drda.unicodemgr=true. This would be false by default but could be switch to true in a maintenance release once opengroup approval occurs. Currently the hope is to have the ACR available publicly by the end of January. Then I would need Rick's help to try to push it through opengroup since he is our opengroup rep. I don't know how long that takes.

          Show
          Kathey Marsden added a comment - I have been talking with the local DRDA experts and found out that they are working on a new DRDA ACR for a UNICODEMGR which I think can help us with this issue. The short summary is this: EXCSAT and ACCSEC are always sent EBCDIC. As part of the MGRLVL exchange, client sends UNICODEMGR 1208 which means that it is requesting all additional DDM parameters will be exchanged in code page 1208. The server responds with UNICODEMGR 1208 if it can accommodate the request. Otherwise it responds with UNICODEMGR 0 and all DDM parameters will continue to be exchanged in EBCDIC. One problem with this approach is that ACCSEC currently has an RDBNAM parameter which we treat as required (the spec lists it as optional) and that has to be sent EBCDIC. So, my proposal is that we use the UNICODEMGR and we send RDBNAM on ACCSEC only if the EBCDIC conversion can be done. If the conversion can't be done, we send no RDBNAM on ACCSEC. Then we change the server to use the RDBNAM sent with the unicode SECCHK instead of the one sent on ACCSEC. (Currently we just verify that the SECCHK RDBNAM is the same as the one that was sent on ACCSEC.) For an old client working with a new server, there will be no regression and the error message on sending a database name with international characters will be the same as currently listed in DERBY-728 . For a new client working with a old server, this will mean that all the cases that currently pass will still pass, but if a nonconvertible database name is sent (e,g, one with Chinese characters) , the server will send back a SYNTAXRM and the server console will show: = 2110; Error Code Value = e. Plaintext connection attempt from an SSL enabled client? org.apache.derby.impl.drda.DRDAProtocolException: Execution failed because of a Distributed Protocol Error: DRDA_Proto_ SYNTAXRM; CODPNT arg = 2110; Error Code Value = e. Plaintext connection attempt from an SSL enabled client? at org.apache.derby.impl.drda.DRDAConnThread.throwSyntaxrm(DRDAConn Thread.java:513) at org.apache.derby.impl.drda.DRDAConnThread.missingCodePoint(DRDAC onnThread.java:543) at org.apache.derby.impl.drda.DRDAConnThread.parseACCSEC(DRDAConnTh read.java:1948) at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDACo nnThread.java:943) at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.jav a:290) The client could intercept the SYNTAXRM and knowing it was unable to convert the RDBNAM to EBCDIC could throw the same message it does now. The only regression would be that users attempting to send an invalid database name would now see the server side protocol error occur. I think it is an acceptable regression, since it won't cause any working cases to fail even with mixed revision server/client and it will enable us to move forward and have internationalized database name, user and password. I prototyped the change and it all seemed to work ok. I'll attach the prototype patch. I would like to implement as much as possible of this for 10.5, but since approval of the ACR by opengroup won't happen by the time we release 10.5, I propose to make the implementation dependent on a client system property derby.drda.unicodemgr=true. This would be false by default but could be switch to true in a maintenance release once opengroup approval occurs. Currently the hope is to have the ACR available publicly by the end of January. Then I would need Rick's help to try to push it through opengroup since he is our opengroup rep. I don't know how long that takes.
          Hide
          Kathey Marsden added a comment -

          Attaching prototype of changes. This patch is not for commit. The prototype implements UNICODEMGR manager level to negotiate DDM parameter encoding. There are some places where performance degredation could occur which need to be addressed and some other issues in comments in the code. The prototype implementation adds a UTF8CcsidManager to client and server and switches the ccsidmgr in the DDMReader and DDMWriter based on the negotiated encoding. Probably that is not the clearest implementation since the new thing is called a UNICODEMGR not a CCSIDMANAGER, but this was the quickest way to implement the prototype. I haven't thought of a better way to do it yet but am open to suggestions.

          The actual implementation will be in subtasks of this issue.

          • Remove required RDBNAM from ACCSEC
          • Client should only send RDBNAM on ACCSEC if EBCDIC conversion is possible.
          • Accomidate length delimited DRDA strings where string length does not equal byte length.
          • Implement UNICODEMGR support. Perhaps this can be broken up into subtasks but hopefully won't be too big ones the other three are done.
          Show
          Kathey Marsden added a comment - Attaching prototype of changes. This patch is not for commit. The prototype implements UNICODEMGR manager level to negotiate DDM parameter encoding. There are some places where performance degredation could occur which need to be addressed and some other issues in comments in the code. The prototype implementation adds a UTF8CcsidManager to client and server and switches the ccsidmgr in the DDMReader and DDMWriter based on the negotiated encoding. Probably that is not the clearest implementation since the new thing is called a UNICODEMGR not a CCSIDMANAGER, but this was the quickest way to implement the prototype. I haven't thought of a better way to do it yet but am open to suggestions. The actual implementation will be in subtasks of this issue. Remove required RDBNAM from ACCSEC Client should only send RDBNAM on ACCSEC if EBCDIC conversion is possible. Accomidate length delimited DRDA strings where string length does not equal byte length. Implement UNICODEMGR support. Perhaps this can be broken up into subtasks but hopefully won't be too big ones the other three are done.
          Hide
          Kathey Marsden added a comment -

          In the reproduction for this issue there is a Chinese character \u4e10. I'd like to know if anyone knows the meaning of this character before I put it in a bunch of tests.

          Thanks

          Kathey

          Show
          Kathey Marsden added a comment - In the reproduction for this issue there is a Chinese character \u4e10. I'd like to know if anyone knows the meaning of this character before I put it in a bunch of tests. Thanks Kathey
          Hide
          Knut Anders Hatlen added a comment -

          According to http://translate.google.com it means "[Formal] a beggar", but I don't speak Chinese, so I really can't say.

          Show
          Knut Anders Hatlen added a comment - According to http://translate.google.com it means " [Formal] a beggar", but I don't speak Chinese, so I really can't say.
          Hide
          Myrna van Lunteren added a comment -

          I also found some web sites that indicate something like that, I think it's safe enough to use in tests.

          Show
          Myrna van Lunteren added a comment - I also found some web sites that indicate something like that, I think it's safe enough to use in tests.
          Hide
          Kathey Marsden added a comment -

          Attached is ACR7007: a proposed change to the DRDA spec for UNICODEMGR, which will allow us to implement a fix for this issue. The ACR was developed within IBM with plans to present it to opengroup for approval. The authors said I could post it here so Derby could benefit and provide comments. Ultimately IBM will submit this to opengroup for incorporation in the spec.

          Please take a look and post any comments to this issue. Rick may be especially interested as our representative at opengroup.

          Show
          Kathey Marsden added a comment - Attached is ACR7007: a proposed change to the DRDA spec for UNICODEMGR, which will allow us to implement a fix for this issue. The ACR was developed within IBM with plans to present it to opengroup for approval. The authors said I could post it here so Derby could benefit and provide comments. Ultimately IBM will submit this to opengroup for incorporation in the spec. Please take a look and post any comments to this issue. Rick may be especially interested as our representative at opengroup.
          Hide
          Tiago R. Espinha added a comment -

          At the risk of pointing out the obvious, I tested this on the 10.5 RC1 using ij and the issue does not arise.

          Interestingly enough, the folder that is created for the database seems to have the Unicode representation of the character and is just called: u4e10

          I'm just wondering whether it isn't somehow assuming the u4e10 as a string literal rather than a Unicode character.

          Show
          Tiago R. Espinha added a comment - At the risk of pointing out the obvious, I tested this on the 10.5 RC1 using ij and the issue does not arise. Interestingly enough, the folder that is created for the database seems to have the Unicode representation of the character and is just called: u4e10 I'm just wondering whether it isn't somehow assuming the u4e10 as a string literal rather than a Unicode character.
          Hide
          Kathey Marsden added a comment -

          Tiago said:
          >I'm just wondering whether it isn't somehow assuming the u4e10 as a string literal >rather than a Unicode character.

          Yes, ij doesn't support escaped unicode characters. You have to use a java program, e.g.
          Connection conn = DriverManager.getConnection("jdbc:derby://localhost:1527/\u4e10;create=true");

          Show
          Kathey Marsden added a comment - Tiago said: >I'm just wondering whether it isn't somehow assuming the u4e10 as a string literal >rather than a Unicode character. Yes, ij doesn't support escaped unicode characters. You have to use a java program, e.g. Connection conn = DriverManager.getConnection("jdbc:derby://localhost:1527/\u4e10;create=true");
          Hide
          Rick Hillegas added a comment -

          Hi Kathey,

          Thanks for attaching the proposed change to Volume 1 of the DRDA spec. It got me thinking about SQL identifiers. It seems from the changes to chapters 6 and 7 on page 5 that DRDA still thinks that sql identifiers are limited to 255 bytes. I don't know where this limitation actually surfaces. It doesn't surface in the simple test I have attached (BigTableName) which pokes and peeks a table whose schema name and table name are each 128 unicode characters long, represented in utf-8 as 384 bytes apiece.

          But it may prevent us from extending our DRDA or SQL support in the future. The maximum length of a SQL identifier is 128 unicode characters, according to part 2 of the 2008 SQL Standard, section 5.2 (<token> and <separator>), syntax rule (13). Derby supports this maximum length. At two bytes per Java character, this works out to 256 bytes, not 255. Since each unicode character can potentially expand to 4 bytes in UTF-8 encoding, the maximum length of a UTF-8 encoded identifier is 512 bytes. I believe that DRDA's sql identifiers should be at least 256 bytes long and probably 512 bytes.

          Thanks.

          Show
          Rick Hillegas added a comment - Hi Kathey, Thanks for attaching the proposed change to Volume 1 of the DRDA spec. It got me thinking about SQL identifiers. It seems from the changes to chapters 6 and 7 on page 5 that DRDA still thinks that sql identifiers are limited to 255 bytes. I don't know where this limitation actually surfaces. It doesn't surface in the simple test I have attached (BigTableName) which pokes and peeks a table whose schema name and table name are each 128 unicode characters long, represented in utf-8 as 384 bytes apiece. But it may prevent us from extending our DRDA or SQL support in the future. The maximum length of a SQL identifier is 128 unicode characters, according to part 2 of the 2008 SQL Standard, section 5.2 (<token> and <separator>), syntax rule (13). Derby supports this maximum length. At two bytes per Java character, this works out to 256 bytes, not 255. Since each unicode character can potentially expand to 4 bytes in UTF-8 encoding, the maximum length of a UTF-8 encoded identifier is 512 bytes. I believe that DRDA's sql identifiers should be at least 256 bytes long and probably 512 bytes. Thanks.
          Hide
          Kathey Marsden added a comment -

          Thank you Rick for looking at this.

          I too am concerned about the DRDA 255 byte character string limit. I had thought of it especially in terms of this feature as users may have much longer paths for database name. I think this general limit extension needs to be addressed as a different ACR. I think we should open a separate issue for it and pursue it as a separate ACR with opengroup.

          Show
          Kathey Marsden added a comment - Thank you Rick for looking at this. I too am concerned about the DRDA 255 byte character string limit. I had thought of it especially in terms of this feature as users may have much longer paths for database name. I think this general limit extension needs to be addressed as a different ACR. I think we should open a separate issue for it and pursue it as a separate ACR with opengroup.
          Hide
          Kathey Marsden added a comment -

          Unassigning this issue. I have not had time to focus on it and don't want to prevent someone else from picking it up.

          I think the spec proposal is complete and the prototype is based on that and seems to work ok. The hardest part seems to be DERBY-4009 to check the byte length limitations because we need to perform the check before we send and currently we do the byte conversion at a pretty low level during the send. The prototype doesn't have the checks.

          Another issue regarding this change may whether to implement it before it officially gets into the spec. I was thinking maybe the initial implementation could be based on a property which would be made the default when the ACR is accepted,. On the other hand there are no user interfaces affected, so maybe it would be ok to go ahead and implement it. We have I think implemented some protocol extensions for setQueryTimeout and session caching that are Derby specific, but I may be wrong on that. This may also be a non-issue at the current rate of progress.

          Anyway, please feel free to pick up on this issue. I will provide any assistance I can.

          Show
          Kathey Marsden added a comment - Unassigning this issue. I have not had time to focus on it and don't want to prevent someone else from picking it up. I think the spec proposal is complete and the prototype is based on that and seems to work ok. The hardest part seems to be DERBY-4009 to check the byte length limitations because we need to perform the check before we send and currently we do the byte conversion at a pretty low level during the send. The prototype doesn't have the checks. Another issue regarding this change may whether to implement it before it officially gets into the spec. I was thinking maybe the initial implementation could be based on a property which would be made the default when the ACR is accepted,. On the other hand there are no user interfaces affected, so maybe it would be ok to go ahead and implement it. We have I think implemented some protocol extensions for setQueryTimeout and session caching that are Derby specific, but I may be wrong on that. This may also be a non-issue at the current rate of progress. Anyway, please feel free to pick up on this issue. I will provide any assistance I can.
          Hide
          Kathey Marsden added a comment -

          I wanted to mention that this ACR 7007 has enterered the opengroup fast track process for review. Rick is the Derby representative at opengroup. Hopefully Rick and the entire opengroup review team will find the ACR in good order and we can implement DERBY-728 to the new specification.

          Show
          Kathey Marsden added a comment - I wanted to mention that this ACR 7007 has enterered the opengroup fast track process for review. Rick is the Derby representative at opengroup. Hopefully Rick and the entire opengroup review team will find the ACR in good order and we can implement DERBY-728 to the new specification.
          Hide
          Kathey Marsden added a comment -

          I marked this issue with labels mentor and gsoc. I would be willing to mentor a returning student in this project, but think it is probably not a good starting project for someone just joining derby.

          Show
          Kathey Marsden added a comment - I marked this issue with labels mentor and gsoc. I would be willing to mentor a returning student in this project, but think it is probably not a good starting project for someone just joining derby.
          Hide
          Kathey Marsden added a comment -

          I was recently discussing UNICODEMGR implementation with some engineers working on other database products. The topic of the encoding for embedded character data in PRDDTA and CRRTKN came up. The fields themselves are architected in DRDA as BYTSTRDR - Byte String, but do contain character data. The question was whether the character data should be UNICODE if UNICODEMGR was being used or should remain EBCDIC. I was asked how it will be in Derby's implemenation. I think it should be UNICODE. It just makes more sense (to me) and is how the current code will work, but should be limited to single byte characters. Let me know if you have a different opinion or better ideas.

          Show
          Kathey Marsden added a comment - I was recently discussing UNICODEMGR implementation with some engineers working on other database products. The topic of the encoding for embedded character data in PRDDTA and CRRTKN came up. The fields themselves are architected in DRDA as BYTSTRDR - Byte String, but do contain character data. The question was whether the character data should be UNICODE if UNICODEMGR was being used or should remain EBCDIC. I was asked how it will be in Derby's implemenation. I think it should be UNICODE. It just makes more sense (to me) and is how the current code will work, but should be limited to single byte characters. Let me know if you have a different opinion or better ideas.
          Hide
          Dag H. Wanvik added a comment -

          I am probably garbling this, since I don't have the DRDA context in my head, but how would this help create database names with Chinese characters? AFAIK they are encoded in Unicode with 3 bytes (UTF-8), not single bytes..

          Show
          Dag H. Wanvik added a comment - I am probably garbling this, since I don't have the DRDA context in my head, but how would this help create database names with Chinese characters? AFAIK they are encoded in Unicode with 3 bytes (UTF-8), not single bytes..
          Hide
          Tiago R. Espinha added a comment -

          @ Kathey:
          I do agree with you, if the UNICODE format is chosen, it should be used to all subsequent exchanges of data. If DRDA specifies this field as byte string, then I suppose the encoding is left to the discretion of the implementation.

          @ Dag:
          In actual fact it wouldn't help the database names but I think for matters of coherence, if the UNICODE format is agreed upon between the parties, then it should be used for all cases.

          Show
          Tiago R. Espinha added a comment - @ Kathey: I do agree with you, if the UNICODE format is chosen, it should be used to all subsequent exchanges of data. If DRDA specifies this field as byte string, then I suppose the encoding is left to the discretion of the implementation. @ Dag: In actual fact it wouldn't help the database names but I think for matters of coherence, if the UNICODE format is agreed upon between the parties, then it should be used for all cases.
          Hide
          Knut Anders Hatlen added a comment -

          Where does the single-byte character restriction come from? If the standard just says these objects contain byte strings, I'd assume we are allowed to encode multi-byte characters in them too.

          Show
          Knut Anders Hatlen added a comment - Where does the single-byte character restriction come from? If the standard just says these objects contain byte strings, I'd assume we are allowed to encode multi-byte characters in them too.
          Hide
          Tiago R. Espinha added a comment -

          I've had a talk with Kathey about how to approach this issue and here's a summary of what was discussed.

          Essentially Kathey explained to me what she implemented in the derby-728_proto_diff.txt prototype. Based on this, we decided that I will first be addressing DERBY-4009 which is a pre-requisite for DERBY-728. After that I will be implementing an UTF8CcsidManager on the server side and respective tests. Finally I will be adding the code for the manager level negotiation (UNICODEMGR introduces a new manager level) and for the switching between the two CcsidManagers.

          Show
          Tiago R. Espinha added a comment - I've had a talk with Kathey about how to approach this issue and here's a summary of what was discussed. Essentially Kathey explained to me what she implemented in the derby-728_proto_diff.txt prototype. Based on this, we decided that I will first be addressing DERBY-4009 which is a pre-requisite for DERBY-728 . After that I will be implementing an UTF8CcsidManager on the server side and respective tests. Finally I will be adding the code for the manager level negotiation (UNICODEMGR introduces a new manager level) and for the switching between the two CcsidManagers.
          Hide
          Tiago R. Espinha added a comment -

          This is my first patch for this issue.

          Kathey, we talked about putting in place that setDatabaseName() method as one functional patch. However, that has already been checked-in.

          As such, what this patch does is set the dbname and shortDbName fields to private. Being protected meant that there were classes that could bypass the setDatabaseName() and set the name directly to the attribute.

          This is usually a bad idea so I encapsulated the fields with getters and setters to enforce the setDatabaseName() method.

          I will be running regressions today.

          Show
          Tiago R. Espinha added a comment - This is my first patch for this issue. Kathey, we talked about putting in place that setDatabaseName() method as one functional patch. However, that has already been checked-in. As such, what this patch does is set the dbname and shortDbName fields to private. Being protected meant that there were classes that could bypass the setDatabaseName() and set the name directly to the attribute. This is usually a bad idea so I encapsulated the fields with getters and setters to enforce the setDatabaseName() method. I will be running regressions today.
          Hide
          Tiago R. Espinha added a comment -

          The regressions ran without issues for DERBY-728_p1.diff.

          Show
          Tiago R. Espinha added a comment - The regressions ran without issues for DERBY-728 _p1.diff.
          Hide
          Tiago R. Espinha added a comment -

          I need to brainstorm a bit here regarding the Utf8CcsidManager class that I have.

          There's one thing that I didn't implement because of a detail regarding the following method:
          public String convertToUCS2(byte[] sourceBytes, int offset, int numToConvert) { }

          So far we had this method on the EbcdicCcsidManager and then it's fine because EBCDIC only uses one byte per character at all times. So the offset parameter always works in a consistent way, that is to say that if offset is 5, we are not only getting 5 sourceBytes but also getting exactly 5 characters.

          However, when we come to an Utf8CcsidManager, this offset might land straight in the middle of a character; then if we cut a character in half byte-wise, we will end with a totally different character.

          Is it acceptable to consider offset as a number of characters rather than a number of bytes? It works both ways in EBCDIC but for UTF8 it would mean converting the sourceBytes to a String, offsetting it character-wise, and then convert the «offsetted» String to UCS2.

          Any other ideas anyone might have?

          Show
          Tiago R. Espinha added a comment - I need to brainstorm a bit here regarding the Utf8CcsidManager class that I have. There's one thing that I didn't implement because of a detail regarding the following method: public String convertToUCS2(byte[] sourceBytes, int offset, int numToConvert) { } So far we had this method on the EbcdicCcsidManager and then it's fine because EBCDIC only uses one byte per character at all times. So the offset parameter always works in a consistent way, that is to say that if offset is 5, we are not only getting 5 sourceBytes but also getting exactly 5 characters. However, when we come to an Utf8CcsidManager, this offset might land straight in the middle of a character; then if we cut a character in half byte-wise, we will end with a totally different character. Is it acceptable to consider offset as a number of characters rather than a number of bytes? It works both ways in EBCDIC but for UTF8 it would mean converting the sourceBytes to a String, offsetting it character-wise, and then convert the «offsetted» String to UCS2. Any other ideas anyone might have?
          Hide
          Kathey Marsden added a comment -

          I think from a practical perspective, at least for the server, the length passed is always going to be on a character border.

          I only looked at the server code, but see the only place where it is ultimately called is from DDMReader.readString();

          protected String readString () throws DRDAProtocolException

          { return readString((int)ddmScalarLen); }

          ddmScalarLen is what was sent from the client as the actual length of the ddm object, so the length should be good.

          I think it would be good enough to document this assumption in the javadoc of the method.

          We could not convert the sourceBytes because the buffer that is being passed is not all data, the rest of it (after offset ++ numToConvert bytes) is just the rest of the empty buffer.

          Show
          Kathey Marsden added a comment - I think from a practical perspective, at least for the server, the length passed is always going to be on a character border. I only looked at the server code, but see the only place where it is ultimately called is from DDMReader.readString(); protected String readString () throws DRDAProtocolException { return readString((int)ddmScalarLen); } ddmScalarLen is what was sent from the client as the actual length of the ddm object, so the length should be good. I think it would be good enough to document this assumption in the javadoc of the method. We could not convert the sourceBytes because the buffer that is being passed is not all data, the rest of it (after offset ++ numToConvert bytes) is just the rest of the empty buffer.
          Hide
          Tiago R. Espinha added a comment -

          Here is the list of changes checked in by patch #2:

          • Utf8CcsidManager.java is rolled in for the server - This CCSID manager is not yet really used in practice. Instances are created but no real live code refers to it yet. I'm still unsure about this implementation and this code might change, but I needed to have a draft in place to proceed further.
          • Both DDMWriter and DDMReader now have three CCSID manager attributes. These two classes have one instance of each of the available managers (UTF-8 and EBCDIC) and a reference to the current enabled one (already part of the existing implementation).

          These two classes also no longer receive a CCSID manager in their constructor. Instead, they default to EBCDIC and whenever required we can setUtf8Ccsid() or revert to EBCDIC by doing setEbcdicCcsid().

          • The DRDAConnThread.java class has also lost its own ccsidManager and will be using the one from its instance of DDMWriter. This class now initializes its DRDAStrings within the initialize() method, as it is only then that we have a DDMWriter available.
          • Finally, the tests ProtocolTest and TestProto have also been changed to accommodate the constructor changes in DDMWriter and DDMReader.

          I will be running regressions tonight. Provided that all goes well and there are no objections to this patch, it is ready for commit.

          Show
          Tiago R. Espinha added a comment - Here is the list of changes checked in by patch #2: Utf8CcsidManager.java is rolled in for the server - This CCSID manager is not yet really used in practice. Instances are created but no real live code refers to it yet. I'm still unsure about this implementation and this code might change, but I needed to have a draft in place to proceed further. Both DDMWriter and DDMReader now have three CCSID manager attributes. These two classes have one instance of each of the available managers (UTF-8 and EBCDIC) and a reference to the current enabled one (already part of the existing implementation). These two classes also no longer receive a CCSID manager in their constructor. Instead, they default to EBCDIC and whenever required we can setUtf8Ccsid() or revert to EBCDIC by doing setEbcdicCcsid(). The DRDAConnThread.java class has also lost its own ccsidManager and will be using the one from its instance of DDMWriter. This class now initializes its DRDAStrings within the initialize() method, as it is only then that we have a DDMWriter available. Finally, the tests ProtocolTest and TestProto have also been changed to accommodate the constructor changes in DDMWriter and DDMReader. – I will be running regressions tonight. Provided that all goes well and there are no objections to this patch, it is ready for commit.
          Hide
          Tiago R. Espinha added a comment -

          I forgot to mention but this patch aims at laying foundation for the dual CCSID manager possibility but the goal is to not break any of the current code. As I mentioned, the UTF-8 CCSID manager isn't yet being used by live code despite being in place and all going well, the EBCDIC will be the default even after the patch, without anything breaking.

          Show
          Tiago R. Espinha added a comment - I forgot to mention but this patch aims at laying foundation for the dual CCSID manager possibility but the goal is to not break any of the current code. As I mentioned, the UTF-8 CCSID manager isn't yet being used by live code despite being in place and all going well, the EBCDIC will be the default even after the patch, without anything breaking.
          Hide
          Tiago R. Espinha added a comment -

          The regressions ran with no failures or errors.

          Show
          Tiago R. Espinha added a comment - The regressions ran with no failures or errors.
          Hide
          Tiago R. Espinha added a comment -

          I'm submitting two patches.

          The first is a new version of DERBY-728_p2.diff which adds the proper ASF header to my Utf8CcsidManager.

          The second checks-in a test for the Utf8CcsidManager class and hooks it to the derbynet suite.

          I'll be running regressions again tonight.

          Show
          Tiago R. Espinha added a comment - I'm submitting two patches. The first is a new version of DERBY-728 _p2.diff which adds the proper ASF header to my Utf8CcsidManager. The second checks-in a test for the Utf8CcsidManager class and hooks it to the derbynet suite. I'll be running regressions again tonight.
          Hide
          Tiago R. Espinha added a comment -

          The regressions ran with no failures using both p2 patches.

          Show
          Tiago R. Espinha added a comment - The regressions ran with no failures using both p2 patches.
          Hide
          Kathey Marsden added a comment -

          Thanks Tiago for the patch, I think the naming of the methods is a bit confusing in the CCSidManger classes the method for example, public String convertToUCS2(byte[] sourceBytes) should probably just be convertToJavaString as as best I can tell that is what it is trying to do. So instead of:

          String sourceString = new String(sourceBytes,"UTF-8");
          return new String(sourceString.getBytes("UTF-16"),"UTF-16");

          I think you could just return sourceString, unless I am missing something entirely.

          I don't see that the CcsidManger numToCharRepresentation is being used anywhere. Could that just be removed?

          Show
          Kathey Marsden added a comment - Thanks Tiago for the patch, I think the naming of the methods is a bit confusing in the CCSidManger classes the method for example, public String convertToUCS2(byte[] sourceBytes) should probably just be convertToJavaString as as best I can tell that is what it is trying to do. So instead of: String sourceString = new String(sourceBytes,"UTF-8"); return new String(sourceString.getBytes("UTF-16"),"UTF-16"); I think you could just return sourceString, unless I am missing something entirely. I don't see that the CcsidManger numToCharRepresentation is being used anywhere. Could that just be removed?
          Hide
          Tiago R. Espinha added a comment -

          Here are two refreshed patches after discussing the above issue with Kathey on IRC. We agreed that the convertToUCS2 method should actually be called convertToJavaString as we have no real requirement to have UCS2 Strings. As long as they are Java Strings, it should be fine as they won't be sent over the network (not in UCS2 anyway).

          With this change, the conversion in the Utf8CcsidManager also changes slightly in this patch.

          I've also removed a leftover comment that shouldn't have stayed there.

          It needs to be said that the naming change should also happen in the client's CcsidManager classes, but I will keep this task for when I do the client changes.

          I'll be running regressions again tonight and post the results in the morning.

          Show
          Tiago R. Espinha added a comment - Here are two refreshed patches after discussing the above issue with Kathey on IRC. We agreed that the convertToUCS2 method should actually be called convertToJavaString as we have no real requirement to have UCS2 Strings. As long as they are Java Strings, it should be fine as they won't be sent over the network (not in UCS2 anyway). With this change, the conversion in the Utf8CcsidManager also changes slightly in this patch. I've also removed a leftover comment that shouldn't have stayed there. It needs to be said that the naming change should also happen in the client's CcsidManager classes, but I will keep this task for when I do the client changes. I'll be running regressions again tonight and post the results in the morning.
          Hide
          Tiago R. Espinha added a comment -

          The regressions failed for the last patch. Do NOT commit. The failure was as follows:

          There was 1 failure:
          1) testBootLock(org.apache.derbyTesting.functionTests.tests.store.BootLockTest)j
          unit.framework.AssertionFailedError: Minion did not start or boot db in 60 secon
          ds.
          ----Minion's stderr:
          Exception in thread "main" java.lang.NumberFormatException: For input string: ""
          at java.lang.NumberFormatException.forInputString(NumberFormatException.
          java:48) at java.lang.Integer.parseInt(Integer.java:470) at java.lang.Int
          eger.valueOf(Integer.java:528) at java.lang.Integer.decode(Integer.java:958)at
          org.apache.derbyTesting.functionTests.tests.store.BootLockMinion.main(BootLockMi
          nion.java:42)
          ----Minion's stderr ended
          at org.apache.derbyTesting.functionTests.tests.store.BootLockTest.waitFo
          rMinionBoot(BootLockTest.java:217)
          at org.apache.derbyTesting.functionTests.tests.store.BootLockTest.testBo
          otLock(BootLockTest.java:130)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
          java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
          sorImpl.java:25)
          at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:
          109)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57
          )
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.extensions.TestSetup.run(TestSetup.java:25)
          at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
          at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
          at junit.extensions.TestSetup.run(TestSetup.java:25)

          FAILURES!!!
          Tests run: 1, Failures: 1, Errors: 0

          It is consistent by running the test by itself. I will be analysing the issue and submitting another patch soon.

          Show
          Tiago R. Espinha added a comment - The regressions failed for the last patch. Do NOT commit. The failure was as follows: There was 1 failure: 1) testBootLock(org.apache.derbyTesting.functionTests.tests.store.BootLockTest)j unit.framework.AssertionFailedError: Minion did not start or boot db in 60 secon ds. ----Minion's stderr: Exception in thread "main" java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException. java:48) at java.lang.Integer.parseInt(Integer.java:470) at java.lang.Int eger.valueOf(Integer.java:528) at java.lang.Integer.decode(Integer.java:958)at org.apache.derbyTesting.functionTests.tests.store.BootLockMinion.main(BootLockMi nion.java:42) ----Minion's stderr ended at org.apache.derbyTesting.functionTests.tests.store.BootLockTest.waitFo rMinionBoot(BootLockTest.java:217) at org.apache.derbyTesting.functionTests.tests.store.BootLockTest.testBo otLock(BootLockTest.java:130) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java: 109) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57 ) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24) at junit.extensions.TestSetup$1.protect(TestSetup.java:21) at junit.extensions.TestSetup.run(TestSetup.java:25) FAILURES!!! Tests run: 1, Failures: 1, Errors: 0 It is consistent by running the test by itself. I will be analysing the issue and submitting another patch soon.
          Hide
          Tiago R. Espinha added a comment -

          It seems these errors come from Kathey's r956075 and are unrelated to my patches. Kathey can you confirm this please?

          Show
          Tiago R. Espinha added a comment - It seems these errors come from Kathey's r956075 and are unrelated to my patches. Kathey can you confirm this please?
          Hide
          Kathey Marsden added a comment -

          Hi Tiago. I think it would be good to change the convertFromUCS2 methods to convertFromJavaString and I wonder about this method which I think should be called
          convertToChar

          public char convertToUCS2Char(byte sourceByte)

          { return (char) sourceByte; }

          For this one should the source byte be in UTF-8 , if so I think a straight cast might be a problem, depending on the default encoding. Is it needed? I don't see it in EbcdicCcsidManager.

          Show
          Kathey Marsden added a comment - Hi Tiago. I think it would be good to change the convertFromUCS2 methods to convertFromJavaString and I wonder about this method which I think should be called convertToChar public char convertToUCS2Char(byte sourceByte) { return (char) sourceByte; } For this one should the source byte be in UTF-8 , if so I think a straight cast might be a problem, depending on the default encoding. Is it needed? I don't see it in EbcdicCcsidManager.
          Hide
          Tiago R. Espinha added a comment -

          I'm attaching two new p2 patches as per Kathey's comments.

          The convertFromUCS2 methods should indeed be convertFromJavaString as that is what it in fact does. These functions will convert any string to UTF-8 as long as it is a valid Java String.

          I've also removed convertToUCS2Char() as this was an artifact that I saw in Kathey's prototype and decided to keep. I'm removing it for now and coming back to it later on if it becomes necessary. By then I expect to have a better understanding of why it is necessary.

          I will be running regressions again tonight.

          Show
          Tiago R. Espinha added a comment - I'm attaching two new p2 patches as per Kathey's comments. The convertFromUCS2 methods should indeed be convertFromJavaString as that is what it in fact does. These functions will convert any string to UTF-8 as long as it is a valid Java String. I've also removed convertToUCS2Char() as this was an artifact that I saw in Kathey's prototype and decided to keep. I'm removing it for now and coming back to it later on if it becomes necessary. By then I expect to have a better understanding of why it is necessary. I will be running regressions again tonight.
          Hide
          Tiago R. Espinha added a comment -

          The regressions ran without failures. I believe the patches are now ready to commit.

          Show
          Tiago R. Espinha added a comment - The regressions ran without failures. I believe the patches are now ready to commit.
          Hide
          Kathey Marsden added a comment -

          Hi Tiago,

          I committed the patch, but then noticed the test names probably ought to be changed too.
          testConvertFromUCS2 used 0 ms .
          testConvertToUCS2 used 0 ms
          Time: 0.593

          Show
          Kathey Marsden added a comment - Hi Tiago, I committed the patch, but then noticed the test names probably ought to be changed too. testConvertFromUCS2 used 0 ms . testConvertToUCS2 used 0 ms Time: 0.593
          Hide
          Tiago R. Espinha added a comment -

          You're right Kathey, I overlooked this. It should be safe to commit straight away, it's just a name change.

          Show
          Tiago R. Espinha added a comment - You're right Kathey, I overlooked this. It should be safe to commit straight away, it's just a name change.
          Hide
          Mike Matrigali added a comment -

          It looks like the new file (Utf8CcsidManager.java) is causing problems with insane builds. When using SanityManager routines you should always surround the call with a:
          if (SanityManager.DEBUG)
          SanityManager.THROWASSERT(...

          This is so that when a release jar is built with sane=false no SanityManager code is included.

          Show
          Mike Matrigali added a comment - It looks like the new file (Utf8CcsidManager.java) is causing problems with insane builds. When using SanityManager routines you should always surround the call with a: if (SanityManager.DEBUG) SanityManager.THROWASSERT(... This is so that when a release jar is built with sane=false no SanityManager code is included.
          Hide
          Tiago R. Espinha added a comment -

          I'm attaching two patches to this issue.

          The first, DERBY_728_p2_sanity.diff aims at fixing an issue with sanity. I was making calls to the SanityManager without a check to verify that the DEBUG was enabled. A question that remains open is: what do we do with the exception on a sane build? Is it ok to muffle it and only deal with it on insane mode?

          The second patch, DERBY_728_p3.diff, should in principle enable UTF-8 support in the server. I'm not sure I've missed something but I'll post a short explanation of what I've done:

          • CodePoint.java
            Inserted the new code point for UNICODEMGR (0x1C08) as per the ACR and added it to the MGR_CODEPOINTS array.
          • NetworkServerControlImpl.java
            Set the minimum manager level for the UNICODEMGR in synch with the MGR_CODEPOINTS array.
          • AppRequester.java
            Set the minimum manager level for the UNICODEMGR in synch with the MGR_CODEPOINTS array.

          Also added a convenience method that tells us whether the AppRequester depicted by this class supports or not UTF8. This relies on the manager level for UNICODEMGR being greater or equal to 1208. If the requester does not support UTF8, we won't get a UNICODEMGR manager level on the EXCSAT and as such UNICODEMGR will be set to 0 (which means this convenience method will return false).

          • DRDAConnThread.java
            When dealing with the ACCSEC code point and after we send the ACCSECRD reply we check whether the appRequester supports UTF8 and if it does, we enable it through the switchToUtf8() method [also part of the patch]. If it doesn't, make sure it goes back to EBCDIC, to make sure we aren't in UTF8 from the previous connection.

          I'm not sure how I'd go about testing this as a client would have to support UTF-8 as well to be able to test it and this is still just the server side implementation. It will be a good test though to see if all regressions pass tonight; tomorrow, provided that this patch seems to be ok, I will do testing with older clients and see what the outcome is.

          Show
          Tiago R. Espinha added a comment - I'm attaching two patches to this issue. The first, DERBY_728_p2_sanity.diff aims at fixing an issue with sanity. I was making calls to the SanityManager without a check to verify that the DEBUG was enabled. A question that remains open is: what do we do with the exception on a sane build? Is it ok to muffle it and only deal with it on insane mode? The second patch, DERBY_728_p3.diff, should in principle enable UTF-8 support in the server. I'm not sure I've missed something but I'll post a short explanation of what I've done: CodePoint.java Inserted the new code point for UNICODEMGR (0x1C08) as per the ACR and added it to the MGR_CODEPOINTS array. – NetworkServerControlImpl.java Set the minimum manager level for the UNICODEMGR in synch with the MGR_CODEPOINTS array. – AppRequester.java Set the minimum manager level for the UNICODEMGR in synch with the MGR_CODEPOINTS array. Also added a convenience method that tells us whether the AppRequester depicted by this class supports or not UTF8. This relies on the manager level for UNICODEMGR being greater or equal to 1208. If the requester does not support UTF8, we won't get a UNICODEMGR manager level on the EXCSAT and as such UNICODEMGR will be set to 0 (which means this convenience method will return false). – DRDAConnThread.java When dealing with the ACCSEC code point and after we send the ACCSECRD reply we check whether the appRequester supports UTF8 and if it does, we enable it through the switchToUtf8() method [also part of the patch] . If it doesn't, make sure it goes back to EBCDIC, to make sure we aren't in UTF8 from the previous connection. – I'm not sure how I'd go about testing this as a client would have to support UTF-8 as well to be able to test it and this is still just the server side implementation. It will be a good test though to see if all regressions pass tonight; tomorrow, provided that this patch seems to be ok, I will do testing with older clients and see what the outcome is.
          Hide
          Tiago R. Espinha added a comment -

          Regressions passed with no failures; I think it's safe to apply at least patch p2_sanity to fix the sanity issues.

          Testing older clients today.

          Show
          Tiago R. Espinha added a comment - Regressions passed with no failures; I think it's safe to apply at least patch p2_sanity to fix the sanity issues. Testing older clients today.
          Hide
          Kathey Marsden added a comment -

          Hi Tiago,

          The only comment I have on the code change is for AppRequester.supportsUtf8Ccsid(), I think == 1208 might be more appropriate than >= and perhaps a static constant would be good.

          For testing, I think you can use the ProtocolTest to test if multi byte characters can be used in database name, userid and password, hopefully escaped unicode characters will work in protocol.tests.

          I committed the sanity manager patch. I think it is fine to just ignore for insane builds, because we don't think it can happen and if it did the NPE that would result would be pretty easy to track.

          It's great to see us so close on the server side!

          Kathey

          Show
          Kathey Marsden added a comment - Hi Tiago, The only comment I have on the code change is for AppRequester.supportsUtf8Ccsid(), I think == 1208 might be more appropriate than >= and perhaps a static constant would be good. For testing, I think you can use the ProtocolTest to test if multi byte characters can be used in database name, userid and password, hopefully escaped unicode characters will work in protocol.tests. I committed the sanity manager patch. I think it is fine to just ignore for insane builds, because we don't think it can happen and if it did the NPE that would result would be pretty easy to track. It's great to see us so close on the server side! Kathey
          Hide
          Tiago R. Espinha added a comment -

          This feature has made it to the 10.7.1 release so unless bugs are found, I'm taking this issue as resolved and I'm closing it.

          Show
          Tiago R. Espinha added a comment - This feature has made it to the 10.7.1 release so unless bugs are found, I'm taking this issue as resolved and I'm closing it.

            People

            • Assignee:
              Tiago R. Espinha
              Reporter:
              Andrei Badea
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development