Derby
  1. Derby
  2. DERBY-1478

Add built in language based ordering and like processing to Derby

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.1.2.1
    • Fix Version/s: 10.4.1.3
    • Component/s: SQL
    • Labels:
      None

      Description

      It would be good for Derby to have built in Language based ordering based on locale specific Collator.

      Language based ordering is an important feature for international deployment. DERBY-533 offers one implementation option for this but according to the discussion in that issue National Character Types carry a fair amount of baggage with them especially in the form of concerns about conversion to and from datetime and number types. Rick mentioned SQL language for collations as an option for language based ordering. There may be other options too, but I thought it worthwhile to add an issue for the high level functional concern, so the best choice can be made for implementation without assuming that National Character Types is the only solution.

      For possible 10.1 workaround and examples see:
      http://wiki.apache.org/db-derby/LanguageBasedOrdering

      1. DERBY-1478_FunctionalSpecV5.html
        6 kB
        Mamta A. Satoor
      2. DERBY-1478_FunctionalSpecV4.html
        6 kB
        Mamta A. Satoor
      3. DERBY-1478_FunctionalSpecV3.html
        5 kB
        Mamta A. Satoor
      4. DERBY-1478_FunctionalSpecV2.html
        5 kB
        Mamta A. Satoor
      5. DERBY-1478_FunctionalSpecV1.html
        5 kB
        Mamta A. Satoor

        Issue Links

        1.
        Accept the new collation attribute and store it as a database property. Will be used for language based collation ordering. Sub-task Closed Mamta A. Satoor
         
        2.
        Compare character datatypes with different collation ordering. Sub-task Closed Mamta A. Satoor
         
        3.
        Enable collation based ordering for CHAR data type. Sub-task Closed Mamta A. Satoor
         
        4.
        Enable collation based ordering for VARCHAR, LONG VARCHAR and CLOB data types. Sub-task Closed Mamta A. Satoor
         
        5.
        Document language based ordering which will be implemented by code related sub-tasks of DERBY-1478. Sub-task Closed Laura Stewart
         
        6.
        implement pushing collation info to store, storing collation info in store metadata, and creating templates based on store metadata Sub-task Closed Mike Matrigali
         
        7.
        Create initial testing framework for Junit Collation testing Sub-task Closed Kathey Marsden
         
        8.
        calling DatabaseMetaData.getColumns() with % for matching column character in a territory based collated db does not work in 1.4.2 jvms Sub-task Closed Unassigned
         
        9.
        fix like clauses with trailing % Sub-task Closed Mike Matrigali
         
        10.
        disable dynamic like optimization for collated databases Sub-task Closed Mike Matrigali
         
        11.
        throw error if non-matching collation ids in like. Sub-task Closed Mike Matrigali
         
        12.
        Make sure that all the DTDs for character string types have correct collation assigned to them. Sub-task Closed Mamta A. Satoor
         
        13.
        Ensure the DataTypeDescriptor.comparable() implements the correct rules for determing the collation to use. Sub-task Closed Unassigned
         
        14.
        Address testing todo items in CollationTest.java Sub-task Closed Kathey Marsden
         

          Activity

          Hide
          Kathey Marsden added a comment -

          DERBY-1478 is a general request for built-in language based ordering and like processing for DERBY. DERBY-533 is one possible implementation.

          Show
          Kathey Marsden added a comment - DERBY-1478 is a general request for built-in language based ordering and like processing for DERBY. DERBY-533 is one possible implementation.
          Hide
          Mamta A. Satoor added a comment -

          While researching the net on internationalization and localization, I found an excellent tutorial by Sun at http://java.sun.com/docs/books/tutorial/i18n/index.html Anyone who wants to get familiarized with the concepts related to internationalization will benefit from this tutorial.

          Show
          Mamta A. Satoor added a comment - While researching the net on internationalization and localization, I found an excellent tutorial by Sun at http://java.sun.com/docs/books/tutorial/i18n/index.html Anyone who wants to get familiarized with the concepts related to internationalization will benefit from this tutorial.
          Hide
          Mamta A. Satoor added a comment -

          I would like to propose a way for supporting locale sensitive data in Derby. Currently, upto Derby 10.2 release, the sorting for CHAR and VARCHAR data types is codepoint based (UNICODE). For someone looking for any locale specific collation, they can possibly write a couple functions as suggested by http://wiki.apache.org/db-derby/LanguageBasedOrdering, but that solution is not complete and not efficient (since functional indexes can't be defined in Derby).

          My proposal for Derby 10.3 is that a user would be able to specify an optional jdbc url attribute, called territoryBasedCollation, at the database create time and that attribute can be set to true or false. If the attribute is not specified or is set to false, then collation will continue to be codepoint based. But if the user specifies true for territoryBasedCollation, the collation will be based on language region specified by the exisiting Derby attribute called territory (territory=ll_CC) http://db.apache.org/derby/docs/10.2/ref/rrefattrib56769.html
          If the territory attribute is not specified at the database create time, Derby 10.2 uses java,util.Locale.getDefault to determine the territory for the newly
          created database.

          I am not planning to implement any collation support on any existing database, ie collation enabling will not be supported at the upgrade database time or on a pre-existing database. Those databases will continue to use codepoint based collation. I am proposing to implement the collation support only for new databases,

          The locale based ordering will impact operations that require returning the order of data. That includes
          1)Comparison using comparison operators (<, >, =, IN, BETWEEN)
          2)Statements that involve sorting (ORDER BY, GROUP BY, DISTINCT, MAX, and MIN)
          3)Statements that use the LIKE keyword

          Derby already has lot of code for locale based ordering for disabled NATIONAL CHAR and NATIONAL VARCHAR datatypes. I hope to leverage highly on that code and see how it can be used for this project. Also, I am keeping a goal for myself to implement this in such a way that databases with codepoint based collation will not get penalized by the code for locale based collation.

          Other than finding a means of storing the territoryBasedCollation attribute from the url somewhere, I don't anticipate any other disk changes as part of this project.

          Please share if there are any comments. In the mean time, I will start looking at how to accept the new jdb url attribute in the create database url and how to store that attribute .

          Show
          Mamta A. Satoor added a comment - I would like to propose a way for supporting locale sensitive data in Derby. Currently, upto Derby 10.2 release, the sorting for CHAR and VARCHAR data types is codepoint based (UNICODE). For someone looking for any locale specific collation, they can possibly write a couple functions as suggested by http://wiki.apache.org/db-derby/LanguageBasedOrdering , but that solution is not complete and not efficient (since functional indexes can't be defined in Derby). My proposal for Derby 10.3 is that a user would be able to specify an optional jdbc url attribute, called territoryBasedCollation, at the database create time and that attribute can be set to true or false. If the attribute is not specified or is set to false, then collation will continue to be codepoint based. But if the user specifies true for territoryBasedCollation, the collation will be based on language region specified by the exisiting Derby attribute called territory (territory=ll_CC) http://db.apache.org/derby/docs/10.2/ref/rrefattrib56769.html If the territory attribute is not specified at the database create time, Derby 10.2 uses java,util.Locale.getDefault to determine the territory for the newly created database. I am not planning to implement any collation support on any existing database, ie collation enabling will not be supported at the upgrade database time or on a pre-existing database. Those databases will continue to use codepoint based collation. I am proposing to implement the collation support only for new databases, The locale based ordering will impact operations that require returning the order of data. That includes 1)Comparison using comparison operators (<, >, =, IN, BETWEEN) 2)Statements that involve sorting (ORDER BY, GROUP BY, DISTINCT, MAX, and MIN) 3)Statements that use the LIKE keyword Derby already has lot of code for locale based ordering for disabled NATIONAL CHAR and NATIONAL VARCHAR datatypes. I hope to leverage highly on that code and see how it can be used for this project. Also, I am keeping a goal for myself to implement this in such a way that databases with codepoint based collation will not get penalized by the code for locale based collation. Other than finding a means of storing the territoryBasedCollation attribute from the url somewhere, I don't anticipate any other disk changes as part of this project. Please share if there are any comments. In the mean time, I will start looking at how to accept the new jdb url attribute in the create database url and how to store that attribute .
          Hide
          Daniel John Debrunner added a comment -

          Collations are generally described by the SQL standard (2003) section 4.2.6. That describes collation names and the ability to support named collations at various levels, such as per-schema and per-table. I think it would be good if your proposed scheme could integrate cleanly with the SQL standard mechanism should anyone decide to implement it in the future.

          So could the boolean JDBC atttribute territoryBasedCollation be instead replaced with a value based attribute (say 'collation') that took a name of a collation? Then the names of the collation could match the SQL standard, such as UCS_BASIC (I think) to represent the existing code point ordering. You could continue to support just two collations (UCS_BASIC) and the territory/locale based one, but the scheme could be expanded in the future to allow other schemes such as case insensitive collation.

          Show
          Daniel John Debrunner added a comment - Collations are generally described by the SQL standard (2003) section 4.2.6. That describes collation names and the ability to support named collations at various levels, such as per-schema and per-table. I think it would be good if your proposed scheme could integrate cleanly with the SQL standard mechanism should anyone decide to implement it in the future. So could the boolean JDBC atttribute territoryBasedCollation be instead replaced with a value based attribute (say 'collation') that took a name of a collation? Then the names of the collation could match the SQL standard, such as UCS_BASIC (I think) to represent the existing code point ordering. You could continue to support just two collations (UCS_BASIC) and the territory/locale based one, but the scheme could be expanded in the future to allow other schemes such as case insensitive collation.
          Hide
          Daniel John Debrunner added a comment -

          I assume that this collation only applies to CHAR and VARCHAR types, would be good to state that in any functional spec/documentation.

          Will the system tables use the localized collation for their character types or continue to always use the unicode code point ordering regardless of any database defined collation? Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issues with other languages, it may be wise to take the same approach so that the system table behaviour is fixed. However that will introduce some issues as then comparing a system column against a user column or a constant must decide on a collation to use. On the other hand changing the collation for system tables may break builtin assumptions or JDBC contracts (e.g. order of JDBC database meta data).

          Show
          Daniel John Debrunner added a comment - I assume that this collation only applies to CHAR and VARCHAR types, would be good to state that in any functional spec/documentation. Will the system tables use the localized collation for their character types or continue to always use the unicode code point ordering regardless of any database defined collation? Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issues with other languages, it may be wise to take the same approach so that the system table behaviour is fixed. However that will introduce some issues as then comparing a system column against a user column or a constant must decide on a collation to use. On the other hand changing the collation for system tables may break builtin assumptions or JDBC contracts (e.g. order of JDBC database meta data).
          Hide
          Mamta A. Satoor added a comment -

          I have updated the functional spec based on some feedback from Dan. The changes to the spec in brief are,
          a)The collation ordering will apply only to CHAR and VARCHAR types
          2)The collation for system table will continue to be code point based no matter what collation user chooses for CHAR and VARCHAR types.
          3)I will change the JDBC attribut to accept String values. This way, this feature can be expanded in future to support more variety of collations rather than just the code point based ordering and territory based ordering.

          The functional spec now looks as follows
          I would like to propose a way for supporting locale sensitive data in Derby. Currently, upto Derby 10.2 release, the sorting for CHAR and VARCHAR data types is codepoint based (UNICODE). For someone looking for any locale specific collation, they can possibly write a couple functions as suggested by http://wiki.apache.org/db-derby/LanguageBasedOrdering, but that solution is not complete and not efficient (since functional indexes can't be defined in Derby).

          My proposal for Derby 10.3 is that a user would be able to specify an optional JDBC url attribute, called collation, at the database create time and that attribute can be set to one of the following 2 values
          1) UCS_BASIC (This means codepoint based collation. This will also be the default collation used by Derby if no collation attribute is specified on the JDBC url at the database create time. This collation is what Derby 10.2 and prior have supported) or
          2)TERRITORY_BASED_COLLATION (the collation will be based on language region specified by the exisiting Derby attribute called territory (territory=ll_CC) http://db.apache.org/derby/docs/10.2/ref/rrefattrib56769.html If the territory attribute is not specified at the database create time, Derby 10.2 uses java.util.Locale.getDefault to determine the territory for the newly created database. Derby 10.3 will continue to use the same mecahnism to determine the territory.)

          The collation attribute will apply only for CHAR and VARCHAR datatypes defined in a user-defined table. System tables will continue to use codepoint based collation for it's CHAR and VARCHAR columns.

          This collation ordering will impact operations that require returning the order of data on CHAR and VARCHAR columns. That includes
          1)Comparison using comparison operators (<, >, =, IN, BETWEEN)
          2)Statements that involve sorting (ORDER BY, GROUP BY, DISTINCT, MAX, and MIN)
          3)Statements that use the LIKE keyword

          Derby already has lot of code for locale based ordering for disabled NATIONAL CHAR and NATIONAL VARCHAR datatypes. I hope to leverage highly on that code and see how it can be used for this project. Also, I am keeping a goal for myself to implement this in such a way that databases with codepoint based collation will not get penalized by the code for locale based collation.

          I am not planning to implement any new collation support on any existing database, ie JDBC attribute collation will not be supported at the upgrade database time or on a pre-existing database. Those databases will continue to use codepoint based collation. I am proposing to implement the collation support only for new databases.

          Other than finding a means of storing the collation attribute from the JDBC url somewhere, I don't anticipate any other disk changes as part of this project.

          Show
          Mamta A. Satoor added a comment - I have updated the functional spec based on some feedback from Dan. The changes to the spec in brief are, a)The collation ordering will apply only to CHAR and VARCHAR types 2)The collation for system table will continue to be code point based no matter what collation user chooses for CHAR and VARCHAR types. 3)I will change the JDBC attribut to accept String values. This way, this feature can be expanded in future to support more variety of collations rather than just the code point based ordering and territory based ordering. The functional spec now looks as follows I would like to propose a way for supporting locale sensitive data in Derby. Currently, upto Derby 10.2 release, the sorting for CHAR and VARCHAR data types is codepoint based (UNICODE). For someone looking for any locale specific collation, they can possibly write a couple functions as suggested by http://wiki.apache.org/db-derby/LanguageBasedOrdering , but that solution is not complete and not efficient (since functional indexes can't be defined in Derby). My proposal for Derby 10.3 is that a user would be able to specify an optional JDBC url attribute, called collation, at the database create time and that attribute can be set to one of the following 2 values 1) UCS_BASIC (This means codepoint based collation. This will also be the default collation used by Derby if no collation attribute is specified on the JDBC url at the database create time. This collation is what Derby 10.2 and prior have supported) or 2)TERRITORY_BASED_COLLATION (the collation will be based on language region specified by the exisiting Derby attribute called territory (territory=ll_CC) http://db.apache.org/derby/docs/10.2/ref/rrefattrib56769.html If the territory attribute is not specified at the database create time, Derby 10.2 uses java.util.Locale.getDefault to determine the territory for the newly created database. Derby 10.3 will continue to use the same mecahnism to determine the territory.) The collation attribute will apply only for CHAR and VARCHAR datatypes defined in a user-defined table. System tables will continue to use codepoint based collation for it's CHAR and VARCHAR columns. This collation ordering will impact operations that require returning the order of data on CHAR and VARCHAR columns. That includes 1)Comparison using comparison operators (<, >, =, IN, BETWEEN) 2)Statements that involve sorting (ORDER BY, GROUP BY, DISTINCT, MAX, and MIN) 3)Statements that use the LIKE keyword Derby already has lot of code for locale based ordering for disabled NATIONAL CHAR and NATIONAL VARCHAR datatypes. I hope to leverage highly on that code and see how it can be used for this project. Also, I am keeping a goal for myself to implement this in such a way that databases with codepoint based collation will not get penalized by the code for locale based collation. I am not planning to implement any new collation support on any existing database, ie JDBC attribute collation will not be supported at the upgrade database time or on a pre-existing database. Those databases will continue to use codepoint based collation. I am proposing to implement the collation support only for new databases. Other than finding a means of storing the collation attribute from the JDBC url somewhere, I don't anticipate any other disk changes as part of this project.
          Hide
          Rick Hillegas added a comment -

          Hi Mamta,

          Thanks for taking on this task. This will be a very useful addition to Derby. I don't see a functional spec attached to this issue. Can you tell me where I can find the spec?

          I'm confused about the distinction between the behavior of user-defined CHAR/VARCHAR and the behavior of CHAR/VARCHAR columns in system tables. I see that Dan raised this issue on January 16. As I read the code, it appears to me that sql identifiers are forced to upper case according to the rules of Locale.ENGLISH. I think this happens in SQLUtil.SQLToUpperCase(). I don't see anything in the SQL spec or in JDBC which gives a special place to that locale. The SQL spec is a bit hard to read, but I get the impression that identifiers are supposed to follow the same comparison rules (at least as far as equality is concerned) as other strings. That is least is how I follow the thread of citations starting in part 2 section 5.4 Syntax Rule 3. The JDBC metadata javadoc talks about sort order but, as far as I can see, does not specify what locale to use. I find it hard to believe that Locale.ENGLISH has been given a special place or that the metadata is supposed to sort in a different order from CHAR/VARCHAR. My personal feeling is that there will be a lot of special cases in the code if we have to use a different sort order for CHAR/VARCHAR depending on whether the strings come out of user or system tables. If we are going to make the two kinds of strings behave differently, I would like to understand more specifics on what problems are being solved and why we think this is the intention of our standards. Thanks.

          Show
          Rick Hillegas added a comment - Hi Mamta, Thanks for taking on this task. This will be a very useful addition to Derby. I don't see a functional spec attached to this issue. Can you tell me where I can find the spec? I'm confused about the distinction between the behavior of user-defined CHAR/VARCHAR and the behavior of CHAR/VARCHAR columns in system tables. I see that Dan raised this issue on January 16. As I read the code, it appears to me that sql identifiers are forced to upper case according to the rules of Locale.ENGLISH. I think this happens in SQLUtil.SQLToUpperCase(). I don't see anything in the SQL spec or in JDBC which gives a special place to that locale. The SQL spec is a bit hard to read, but I get the impression that identifiers are supposed to follow the same comparison rules (at least as far as equality is concerned) as other strings. That is least is how I follow the thread of citations starting in part 2 section 5.4 Syntax Rule 3. The JDBC metadata javadoc talks about sort order but, as far as I can see, does not specify what locale to use. I find it hard to believe that Locale.ENGLISH has been given a special place or that the metadata is supposed to sort in a different order from CHAR/VARCHAR. My personal feeling is that there will be a lot of special cases in the code if we have to use a different sort order for CHAR/VARCHAR depending on whether the strings come out of user or system tables. If we are going to make the two kinds of strings behave differently, I would like to understand more specifics on what problems are being solved and why we think this is the intention of our standards. Thanks.
          Hide
          Mamta A. Satoor added a comment -

          Rick, the latest functional spec is part of the Jira comment from me, dated Feb 2nd '07. It starts after this line in the comment "The functional spec now looks as follows ".

          I think as more comments get added to this Jira entry, it will be hard to track down the functional spec in the Jira comments, so I will go ahead and attach it as a file to Jira entry for easier lookup.

          As for your specific questions about special treatement of ENGLISH locale, let me take some time in looking up the SQL spec and existing code to see why Derby might be using this for SQL identifiers. If somebody is aware of specific Derby/Cloudscape history behind SQLUtil.SQLToUpperCase(), then please share those.

          Show
          Mamta A. Satoor added a comment - Rick, the latest functional spec is part of the Jira comment from me, dated Feb 2nd '07. It starts after this line in the comment "The functional spec now looks as follows ". I think as more comments get added to this Jira entry, it will be hard to track down the functional spec in the Jira comments, so I will go ahead and attach it as a file to Jira entry for easier lookup. As for your specific questions about special treatement of ENGLISH locale, let me take some time in looking up the SQL spec and existing code to see why Derby might be using this for SQL identifiers. If somebody is aware of specific Derby/Cloudscape history behind SQLUtil.SQLToUpperCase(), then please share those.
          Hide
          Mamta A. Satoor added a comment -

          Attaching the functional spec as DERBY-1478_FunctionalSpecV1.html

          Show
          Mamta A. Satoor added a comment - Attaching the functional spec as DERBY-1478 _FunctionalSpecV1.html
          Hide
          Mamta A. Satoor added a comment -

          Rick, I looked at SQL specification(Part 2) regarding SQL identifiers. For background, some general information on SQL identifiers from SQL spec if as follows
          <Start of contents from SQL spec>
          1)As per SQL specification Part 2, Section 4.2.4, the character repertoire for sql identifiers, SQL_IDENTIFIER, consists of <SQL language character> Latin characters and digits,and all the other characters that the SQL-implementation supports for use in <regular identifier>. After this, everything else related to SQL_IDENTIFER character repertoire is defined as implementation-defined. To be specific,
          2)Section 4.2.5, Character encoding form, Pg 22 says SQL_IDENTIFIER is an implementation-defined character encoding form. It is applicable to the SQL_IDENTIFIER character repertoire.
          3)Section 4.2.6, Collation, Pg 23, says SQL_IDENTIFIER is an implementation-defined collation. It is applicable to the SQL_IDENTIFIER character repertoire.
          4)And lastly, in Section 4.2.7, Character Sets, SQL_IDENTIFIER is a character set whose repertoire is SQL_IDENTIFIER and whose character encoding form is SQL_IDENTIFIER. The name of its default collation is SQL_IDENTIFIER.
          5)Section 4.2.3.1, Pg 19, talks about case folding. <fold> is a pair of funtions for converting all the lower case and title case characters in a given string to upper case or all the upper case and title case characters to lower case. A lower case character is a character in the Unicode General Category class "Ll" and upper case character is a character in the Unicode General Category class "Lu".
          <End of contents from SQL spec>

          From the information above, we see that SQL specification leaves CEF and collation for SQL identifiers as implementation-defined but I donot see it saying specifically that case folding as implementation-defined. Even the section 4.2.3.1, Pg 19, second paragraph, talks about converting case in a generic manner in the context of UNICODE and not English locale.

          So, I am not sure why Derby/Cloudscape chose to use English locale to do case conversion of SQL identifiers. Derby's StringUtil class, where the SQL case conversion code lies, has following comment
          // The functions below are used for uppercasing SQL in a consistent manner.
          // Cloudscape will uppercase Turkish to the English locale to avoid i
          // uppercasing to an uppercase dotted i. In future versions, all
          // casing will be done in English. The result will be that we will get
          // only the 1:1 mappings in
          // http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.txt
          // and avoid the 1:n mappings in
          //http://www.unicode.org/Public/3.0-Update1/SpecialCasing-3.txt
          //
          // Any SQL casing should use these functions

          Dan, you mentioned in one of your comments to this Jira entry that "Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issue with other languages". Can you please explaing what you mean by unexpected issues? Is that the same reason for recommending same behavior for system tables?

          Show
          Mamta A. Satoor added a comment - Rick, I looked at SQL specification(Part 2) regarding SQL identifiers. For background, some general information on SQL identifiers from SQL spec if as follows <Start of contents from SQL spec> 1)As per SQL specification Part 2, Section 4.2.4, the character repertoire for sql identifiers, SQL_IDENTIFIER, consists of <SQL language character> Latin characters and digits,and all the other characters that the SQL-implementation supports for use in <regular identifier>. After this, everything else related to SQL_IDENTIFER character repertoire is defined as implementation-defined. To be specific, 2)Section 4.2.5, Character encoding form, Pg 22 says SQL_IDENTIFIER is an implementation-defined character encoding form. It is applicable to the SQL_IDENTIFIER character repertoire. 3)Section 4.2.6, Collation, Pg 23, says SQL_IDENTIFIER is an implementation-defined collation. It is applicable to the SQL_IDENTIFIER character repertoire. 4)And lastly, in Section 4.2.7, Character Sets, SQL_IDENTIFIER is a character set whose repertoire is SQL_IDENTIFIER and whose character encoding form is SQL_IDENTIFIER. The name of its default collation is SQL_IDENTIFIER. 5)Section 4.2.3.1, Pg 19, talks about case folding. <fold> is a pair of funtions for converting all the lower case and title case characters in a given string to upper case or all the upper case and title case characters to lower case. A lower case character is a character in the Unicode General Category class "Ll" and upper case character is a character in the Unicode General Category class "Lu". <End of contents from SQL spec> From the information above, we see that SQL specification leaves CEF and collation for SQL identifiers as implementation-defined but I donot see it saying specifically that case folding as implementation-defined. Even the section 4.2.3.1, Pg 19, second paragraph, talks about converting case in a generic manner in the context of UNICODE and not English locale. So, I am not sure why Derby/Cloudscape chose to use English locale to do case conversion of SQL identifiers. Derby's StringUtil class, where the SQL case conversion code lies, has following comment // The functions below are used for uppercasing SQL in a consistent manner. // Cloudscape will uppercase Turkish to the English locale to avoid i // uppercasing to an uppercase dotted i. In future versions, all // casing will be done in English. The result will be that we will get // only the 1:1 mappings in // http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.txt // and avoid the 1:n mappings in // http://www.unicode.org/Public/3.0-Update1/SpecialCasing-3.txt // // Any SQL casing should use these functions Dan, you mentioned in one of your comments to this Jira entry that "Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issue with other languages". Can you please explaing what you mean by unexpected issues? Is that the same reason for recommending same behavior for system tables?
          Hide
          Daniel John Debrunner added a comment -

          Mamta> Dan, you mentioned in one of your comments to this Jira entry that "Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issue with other languages". Can you please explaing what you mean by unexpected issues? Is that the same reason for recommending same behavior for system tables?

          For the reasons you discovered in StringUtil and quoted above. The Turkish locale changes how a lower case i is upper-cased. Allowing the locale of the database engine to influence the casing of the identifiers can lead to the application having to have different versions of its SQL depending on the locale of the database or more likely, if the application developer is not aware of this issue, unexpected failures

          create table customer( id int)

          SELECT ID FROM CUSTOMER - will fail if upper casing of identifiers is Turkish.

          Yes, one could try to be consistent in the application, but the schema and application may be developed by different groups. The app developer may only learn of the schema through JDBC metadata thus only knowing that the column is called 'ID'.

          Show
          Daniel John Debrunner added a comment - Mamta> Dan, you mentioned in one of your comments to this Jira entry that "Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issue with other languages". Can you please explaing what you mean by unexpected issues? Is that the same reason for recommending same behavior for system tables? For the reasons you discovered in StringUtil and quoted above. The Turkish locale changes how a lower case i is upper-cased. Allowing the locale of the database engine to influence the casing of the identifiers can lead to the application having to have different versions of its SQL depending on the locale of the database or more likely, if the application developer is not aware of this issue, unexpected failures create table customer( id int) SELECT ID FROM CUSTOMER - will fail if upper casing of identifiers is Turkish. Yes, one could try to be consistent in the application, but the schema and application may be developed by different groups. The app developer may only learn of the schema through JDBC metadata thus only knowing that the column is called 'ID'.
          Hide
          Rick Hillegas added a comment -

          Thanks, Mamta and Dan, for clarifying what the SQL Standard says here and for spelling out the issues with our implementation. In order to support SQL collations in the future, we will need to be careful to design this so that it will be clear that the string columns in the system catalogs have a different collation than user-defined string columns.

          Show
          Rick Hillegas added a comment - Thanks, Mamta and Dan, for clarifying what the SQL Standard says here and for spelling out the issues with our implementation. In order to support SQL collations in the future, we will need to be careful to design this so that it will be clear that the string columns in the system catalogs have a different collation than user-defined string columns.
          Hide
          Daniel John Debrunner added a comment -

          Agreed that careful design is needed, but there is nothing special about system columns.

          SQL supports schema specific collation, thus in the future Derby could support UNICODE collation in schema SALES and case insensitive collation in schema ENGINEERING. The different collation for system schemas is (should be) just an instance of this (without the ability for the user to define schema specific collation).

          Show
          Daniel John Debrunner added a comment - Agreed that careful design is needed, but there is nothing special about system columns. SQL supports schema specific collation, thus in the future Derby could support UNICODE collation in schema SALES and case insensitive collation in schema ENGINEERING. The different collation for system schemas is (should be) just an instance of this (without the ability for the user to define schema specific collation).
          Hide
          Daniel John Debrunner added a comment -

          Having worked on the DatabaseMetaDataTest and Saurabh fixing DERBY-2259 I now think the collation needs to apply to all charactr types because CLOB and LONG VARCHAR can be used in a LIKE expression.

          Show
          Daniel John Debrunner added a comment - Having worked on the DatabaseMetaDataTest and Saurabh fixing DERBY-2259 I now think the collation needs to apply to all charactr types because CLOB and LONG VARCHAR can be used in a LIKE expression.
          Hide
          Mamta A. Satoor added a comment -

          Dan, can you share your thoughts on why recent work on DatabaseMetaDataTest and Saurabh fixing DERBY-2259 indicates that collation applies to all character datatypes? In the mean time, I will spend some time myself on the DatabaseMetaDataTest and work done by Saurabh on DERBY-2259.

          Show
          Mamta A. Satoor added a comment - Dan, can you share your thoughts on why recent work on DatabaseMetaDataTest and Saurabh fixing DERBY-2259 indicates that collation applies to all character datatypes? In the mean time, I will spend some time myself on the DatabaseMetaDataTest and work done by Saurabh on DERBY-2259 .
          Hide
          Mamta A. Satoor added a comment -

          Another issue I have been thinking about is what would happen when character type columns with different collation are compared. For instance
          Let's say, a database is created to use a territory based collation for character types. And say there is a userSchema schema in that database which has a table tableInfo with column tablename defined as VARCHAR. This tableInfo.tablename will have territory based collation assoicated with it. And say this column is then compared with a VARCHAR column in SYS schema, then how will the comparison happen, since the 2 columns being compared have different collation associated with them? I haven't looked in SQL spec yet. Does anyone has any thoughts on this comparison?
          select * from sys.systables and userSchema.tableInfo where systables.tablename = tableInfo,tablename

          Show
          Mamta A. Satoor added a comment - Another issue I have been thinking about is what would happen when character type columns with different collation are compared. For instance Let's say, a database is created to use a territory based collation for character types. And say there is a userSchema schema in that database which has a table tableInfo with column tablename defined as VARCHAR. This tableInfo.tablename will have territory based collation assoicated with it. And say this column is then compared with a VARCHAR column in SYS schema, then how will the comparison happen, since the 2 columns being compared have different collation associated with them? I haven't looked in SQL spec yet. Does anyone has any thoughts on this comparison? select * from sys.systables and userSchema.tableInfo where systables.tablename = tableInfo,tablename
          Hide
          Rick Hillegas added a comment -

          Hi Mamta,

          As I read part 2 of the SQL Standard, it looks like you need a CAST in order to compare 2 strings which have different collations bound to them. Both string operands must have the same collation--that is my reading of Syntax rule 3b in section 9.13. Sections 6.12 and 6.1 explain how to cast the operands so that you can compare them. I think you need to write an expression like this:

          WHERE userStringCol = CAST ( systemStringCol AS VARCHAR COLLATE userStringColumnsCollation )

          Here's an example I googled up: http://docs.openlinksw.com/virtuoso/sqlrefDATATYPES.html. Hope this helps.

          Show
          Rick Hillegas added a comment - Hi Mamta, As I read part 2 of the SQL Standard, it looks like you need a CAST in order to compare 2 strings which have different collations bound to them. Both string operands must have the same collation--that is my reading of Syntax rule 3b in section 9.13. Sections 6.12 and 6.1 explain how to cast the operands so that you can compare them. I think you need to write an expression like this: WHERE userStringCol = CAST ( systemStringCol AS VARCHAR COLLATE userStringColumnsCollation ) Here's an example I googled up: http://docs.openlinksw.com/virtuoso/sqlrefDATATYPES.html . Hope this helps.
          Hide
          Mamta A. Satoor added a comment -

          I looked through the DatabaseMetaDataTest and found that LONG VARCHAR and CLOB have DatabaseMetaData.typePredChar defined on them which means that a user can use WHERE...LIKE on such columns. Since one of the operations that is impacted by this Jira entry is LIKE comparison, LONG VARCHAR and CLOB will also need to be aware of their collation just like CHAR and VARCHAR. I will update the functional specification to include LONG VARCHAR and CLOB alongwith CHAR and VARCHAR. I am attaching version 2 of the functional spec based on these changes (DERBY-1478LanguageBasedOrdering\DERBY-1478_FunctionalSpecV2.html).

          Show
          Mamta A. Satoor added a comment - I looked through the DatabaseMetaDataTest and found that LONG VARCHAR and CLOB have DatabaseMetaData.typePredChar defined on them which means that a user can use WHERE...LIKE on such columns. Since one of the operations that is impacted by this Jira entry is LIKE comparison, LONG VARCHAR and CLOB will also need to be aware of their collation just like CHAR and VARCHAR. I will update the functional specification to include LONG VARCHAR and CLOB alongwith CHAR and VARCHAR. I am attaching version 2 of the functional spec based on these changes ( DERBY-1478 LanguageBasedOrdering\ DERBY-1478 _FunctionalSpecV2.html).
          Hide
          Mamta A. Satoor added a comment -

          Fixed the typo in the url that pointed to writing functions to achieve collation. The url http://wiki.apache.org/db-derby/LanguageBasedOrdering had a comma at the end and hence it won't find the page.

          Show
          Mamta A. Satoor added a comment - Fixed the typo in the url that pointed to writing functions to achieve collation. The url http://wiki.apache.org/db-derby/LanguageBasedOrdering had a comma at the end and hence it won't find the page.
          Hide
          Mamta A. Satoor added a comment -

          Updated the functional spec with following changes and attached it as DERBY-1478_FunctionalSpecV4.html

          1)The valid values for collation attributes are UCS_BASIC and TERRITORY_BASED
          2)Updated the spec to say that collation attribute will be saved as a database property.

          Show
          Mamta A. Satoor added a comment - Updated the functional spec with following changes and attached it as DERBY-1478 _FunctionalSpecV4.html 1)The valid values for collation attributes are UCS_BASIC and TERRITORY_BASED 2)Updated the spec to say that collation attribute will be saved as a database property.
          Hide
          Bryan Pendleton added a comment -

          Do we envision using this feature to provide case-insensitive searching support? For example,
          if I had a column in my database named "STATE_NAME", and I wanted to set things up so
          that "WHERE STATE_NAME='New York'" and "WHERE STATE_NAME='NEW YORK'" both
          selected the same set of rows, could I use this collation feature to do that?

          Show
          Bryan Pendleton added a comment - Do we envision using this feature to provide case-insensitive searching support? For example, if I had a column in my database named "STATE_NAME", and I wanted to set things up so that "WHERE STATE_NAME='New York'" and "WHERE STATE_NAME='NEW YORK'" both selected the same set of rows, could I use this collation feature to do that?
          Hide
          Mamta A. Satoor added a comment -

          The code work that will go in for this feature can be extended by someone with the itch to provide case-insensitive search support. Currently, with this feature, the Collator will be determined by Derby code from the territory and it will be passed to Collator sensitive CHAR classes so that can do collation based on that. Somebody would need to do the coding work to get the correct Collator for case-insensitive search and pass that it to Collator sensitive CHAR classes.

          So, in short, at the end of this feature, an end user can't simply configure Derby to do case-insensitive search rather than territory based search. Some Derby coding will be required to enable case-insensitive search.

          Show
          Mamta A. Satoor added a comment - The code work that will go in for this feature can be extended by someone with the itch to provide case-insensitive search support. Currently, with this feature, the Collator will be determined by Derby code from the territory and it will be passed to Collator sensitive CHAR classes so that can do collation based on that. Somebody would need to do the coding work to get the correct Collator for case-insensitive search and pass that it to Collator sensitive CHAR classes. So, in short, at the end of this feature, an end user can't simply configure Derby to do case-insensitive search rather than territory based search. Some Derby coding will be required to enable case-insensitive search.
          Hide
          Mamta A. Satoor added a comment -

          Updated the functional spec to cover DERBY-2669 which defines the behavior for the case where collation attribute is specified but no terriotry attribute is specified at the database create time.

          Show
          Mamta A. Satoor added a comment - Updated the functional spec to cover DERBY-2669 which defines the behavior for the case where collation attribute is specified but no terriotry attribute is specified at the database create time.
          Hide
          wayne townsend-merino added a comment -

          It would be fantastic if we could set the collation on a per-field basis, as in:
          alter table foo modify bar varchar(10) COLLATE latin1_bin ;
          This way, wouldn't we be able to do a case-sensitive search for a field, even if we set a case-INsensitive collation for the database? Let me know if I can help.

          Show
          wayne townsend-merino added a comment - It would be fantastic if we could set the collation on a per-field basis, as in: alter table foo modify bar varchar(10) COLLATE latin1_bin ; This way, wouldn't we be able to do a case-sensitive search for a field, even if we set a case-INsensitive collation for the database? Let me know if I can help.
          Hide
          Mamta A. Satoor added a comment -

          I agree that it will be pretty useful to provide collation on a per column basis. We probably should enter a new Jira entry for it for someone with itch to start looking at it.

          Show
          Mamta A. Satoor added a comment - I agree that it will be pretty useful to provide collation on a per column basis. We probably should enter a new Jira entry for it for someone with itch to start looking at it.
          Hide
          Kathey Marsden added a comment -

          Should there be a release note for this issue?

          Show
          Kathey Marsden added a comment - Should there be a release note for this issue?
          Hide
          Kathey Marsden added a comment -

          Can this issue be resolved?

          Show
          Kathey Marsden added a comment - Can this issue be resolved?
          Hide
          Mamta A. Satoor added a comment -

          Kathey, the subtask DERBY-2876 is still open. Without DERBY-2876, Derby's implementation is a subset of SQL specification. The remaining SQL spec can be implemented as a new stand alone jira entry. I will go ahead and open a new independent jira entry for DERBY-2876 so I can go ahead close DERBY-2876 and DERBY-1478.

          Show
          Mamta A. Satoor added a comment - Kathey, the subtask DERBY-2876 is still open. Without DERBY-2876 , Derby's implementation is a subset of SQL specification. The remaining SQL spec can be implemented as a new stand alone jira entry. I will go ahead and open a new independent jira entry for DERBY-2876 so I can go ahead close DERBY-2876 and DERBY-1478 .
          Hide
          Mamta A. Satoor added a comment -

          I have created DERBY-3651 (which is a jira entry for the subtask DERBY-2876). Will go ahead and close DERBY-1478 now.

          Show
          Mamta A. Satoor added a comment - I have created DERBY-3651 (which is a jira entry for the subtask DERBY-2876 ). Will go ahead and close DERBY-1478 now.

            People

            • Assignee:
              Mamta A. Satoor
              Reporter:
              Kathey Marsden
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development