|
Kathey Marsden made changes - 14/Jul/06 08:56 PM
Kathey Marsden made changes - 04/Aug/06 08:57 PM
Mamta A. Satoor made changes - 13/Dec/06 06:23 PM
While researching the net on internationalization and localization, I found an excellent tutorial by Sun at http://java.sun.com/docs/books/tutorial/i18n/index.html Anyone who wants to get familiarized with the concepts related to internationalization will benefit from this tutorial.
I would like to propose a way for supporting locale sensitive data in Derby. Currently, upto Derby 10.2 release, the sorting for CHAR and VARCHAR data types is codepoint based (UNICODE). For someone looking for any locale specific collation, they can possibly write a couple functions as suggested by http://wiki.apache.org/db-derby/LanguageBasedOrdering, but that solution is not complete and not efficient (since functional indexes can't be defined in Derby).
My proposal for Derby 10.3 is that a user would be able to specify an optional jdbc url attribute, called territoryBasedCollation, at the database create time and that attribute can be set to true or false. If the attribute is not specified or is set to false, then collation will continue to be codepoint based. But if the user specifies true for territoryBasedCollation, the collation will be based on language region specified by the exisiting Derby attribute called territory (territory=ll_CC) http://db.apache.org/derby/docs/10.2/ref/rrefattrib56769.html If the territory attribute is not specified at the database create time, Derby 10.2 uses java,util.Locale.getDefault to determine the territory for the newly created database. I am not planning to implement any collation support on any existing database, ie collation enabling will not be supported at the upgrade database time or on a pre-existing database. Those databases will continue to use codepoint based collation. I am proposing to implement the collation support only for new databases, The locale based ordering will impact operations that require returning the order of data. That includes 1)Comparison using comparison operators (<, >, =, IN, BETWEEN) 2)Statements that involve sorting (ORDER BY, GROUP BY, DISTINCT, MAX, and MIN) 3)Statements that use the LIKE keyword Derby already has lot of code for locale based ordering for disabled NATIONAL CHAR and NATIONAL VARCHAR datatypes. I hope to leverage highly on that code and see how it can be used for this project. Also, I am keeping a goal for myself to implement this in such a way that databases with codepoint based collation will not get penalized by the code for locale based collation. Other than finding a means of storing the territoryBasedCollation attribute from the url somewhere, I don't anticipate any other disk changes as part of this project. Please share if there are any comments. In the mean time, I will start looking at how to accept the new jdb url attribute in the create database url and how to store that attribute . Collations are generally described by the SQL standard (2003) section 4.2.6. That describes collation names and the ability to support named collations at various levels, such as per-schema and per-table. I think it would be good if your proposed scheme could integrate cleanly with the SQL standard mechanism should anyone decide to implement it in the future.
So could the boolean JDBC atttribute territoryBasedCollation be instead replaced with a value based attribute (say 'collation') that took a name of a collation? Then the names of the collation could match the SQL standard, such as UCS_BASIC (I think) to represent the existing code point ordering. You could continue to support just two collations (UCS_BASIC) and the territory/locale based one, but the scheme could be expanded in the future to allow other schemes such as case insensitive collation. I assume that this collation only applies to CHAR and VARCHAR types, would be good to state that in any functional spec/documentation.
Will the system tables use the localized collation for their character types or continue to always use the unicode code point ordering regardless of any database defined collation? Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issues with other languages, it may be wise to take the same approach so that the system table behaviour is fixed. However that will introduce some issues as then comparing a system column against a user column or a constant must decide on a collation to use. On the other hand changing the collation for system tables may break builtin assumptions or JDBC contracts (e.g. order of JDBC database meta data). I have updated the functional spec based on some feedback from Dan. The changes to the spec in brief are,
a)The collation ordering will apply only to CHAR and VARCHAR types 2)The collation for system table will continue to be code point based no matter what collation user chooses for CHAR and VARCHAR types. 3)I will change the JDBC attribut to accept String values. This way, this feature can be expanded in future to support more variety of collations rather than just the code point based ordering and territory based ordering. The functional spec now looks as follows I would like to propose a way for supporting locale sensitive data in Derby. Currently, upto Derby 10.2 release, the sorting for CHAR and VARCHAR data types is codepoint based (UNICODE). For someone looking for any locale specific collation, they can possibly write a couple functions as suggested by http://wiki.apache.org/db-derby/LanguageBasedOrdering, but that solution is not complete and not efficient (since functional indexes can't be defined in Derby). My proposal for Derby 10.3 is that a user would be able to specify an optional JDBC url attribute, called collation, at the database create time and that attribute can be set to one of the following 2 values 1) UCS_BASIC (This means codepoint based collation. This will also be the default collation used by Derby if no collation attribute is specified on the JDBC url at the database create time. This collation is what Derby 10.2 and prior have supported) or 2)TERRITORY_BASED_COLLATION (the collation will be based on language region specified by the exisiting Derby attribute called territory (territory=ll_CC) http://db.apache.org/derby/docs/10.2/ref/rrefattrib56769.html If the territory attribute is not specified at the database create time, Derby 10.2 uses java.util.Locale.getDefault to determine the territory for the newly created database. Derby 10.3 will continue to use the same mecahnism to determine the territory.) The collation attribute will apply only for CHAR and VARCHAR datatypes defined in a user-defined table. System tables will continue to use codepoint based collation for it's CHAR and VARCHAR columns. This collation ordering will impact operations that require returning the order of data on CHAR and VARCHAR columns. That includes 1)Comparison using comparison operators (<, >, =, IN, BETWEEN) 2)Statements that involve sorting (ORDER BY, GROUP BY, DISTINCT, MAX, and MIN) 3)Statements that use the LIKE keyword Derby already has lot of code for locale based ordering for disabled NATIONAL CHAR and NATIONAL VARCHAR datatypes. I hope to leverage highly on that code and see how it can be used for this project. Also, I am keeping a goal for myself to implement this in such a way that databases with codepoint based collation will not get penalized by the code for locale based collation. I am not planning to implement any new collation support on any existing database, ie JDBC attribute collation will not be supported at the upgrade database time or on a pre-existing database. Those databases will continue to use codepoint based collation. I am proposing to implement the collation support only for new databases. Other than finding a means of storing the collation attribute from the JDBC url somewhere, I don't anticipate any other disk changes as part of this project. Hi Mamta,
Thanks for taking on this task. This will be a very useful addition to Derby. I don't see a functional spec attached to this issue. Can you tell me where I can find the spec? I'm confused about the distinction between the behavior of user-defined CHAR/VARCHAR and the behavior of CHAR/VARCHAR columns in system tables. I see that Dan raised this issue on January 16. As I read the code, it appears to me that sql identifiers are forced to upper case according to the rules of Locale.ENGLISH. I think this happens in SQLUtil.SQLToUpperCase(). I don't see anything in the SQL spec or in JDBC which gives a special place to that locale. The SQL spec is a bit hard to read, but I get the impression that identifiers are supposed to follow the same comparison rules (at least as far as equality is concerned) as other strings. That is least is how I follow the thread of citations starting in part 2 section 5.4 Syntax Rule 3. The JDBC metadata javadoc talks about sort order but, as far as I can see, does not specify what locale to use. I find it hard to believe that Locale.ENGLISH has been given a special place or that the metadata is supposed to sort in a different order from CHAR/VARCHAR. My personal feeling is that there will be a lot of special cases in the code if we have to use a different sort order for CHAR/VARCHAR depending on whether the strings come out of user or system tables. If we are going to make the two kinds of strings behave differently, I would like to understand more specifics on what problems are being solved and why we think this is the intention of our standards. Thanks. Rick, the latest functional spec is part of the Jira comment from me, dated Feb 2nd '07. It starts after this line in the comment "The functional spec now looks as follows ".
I think as more comments get added to this Jira entry, it will be hard to track down the functional spec in the Jira comments, so I will go ahead and attach it as a file to Jira entry for easier lookup. As for your specific questions about special treatement of ENGLISH locale, let me take some time in looking up the SQL spec and existing code to see why Derby might be using this for SQL identifiers. If somebody is aware of specific Derby/Cloudscape history behind SQLUtil.SQLToUpperCase(), then please share those. Attaching the functional spec as
Mamta A. Satoor made changes - 05/Feb/07 07:48 PM
Rick, I looked at SQL specification(Part 2) regarding SQL identifiers. For background, some general information on SQL identifiers from SQL spec if as follows
<Start of contents from SQL spec> 1)As per SQL specification Part 2, Section 4.2.4, the character repertoire for sql identifiers, SQL_IDENTIFIER, consists of <SQL language character> Latin characters and digits,and all the other characters that the SQL-implementation supports for use in <regular identifier>. After this, everything else related to SQL_IDENTIFER character repertoire is defined as implementation-defined. To be specific, 2)Section 4.2.5, Character encoding form, Pg 22 says SQL_IDENTIFIER is an implementation-defined character encoding form. It is applicable to the SQL_IDENTIFIER character repertoire. 3)Section 4.2.6, Collation, Pg 23, says SQL_IDENTIFIER is an implementation-defined collation. It is applicable to the SQL_IDENTIFIER character repertoire. 4)And lastly, in Section 4.2.7, Character Sets, SQL_IDENTIFIER is a character set whose repertoire is SQL_IDENTIFIER and whose character encoding form is SQL_IDENTIFIER. The name of its default collation is SQL_IDENTIFIER. 5)Section 4.2.3.1, Pg 19, talks about case folding. <fold> is a pair of funtions for converting all the lower case and title case characters in a given string to upper case or all the upper case and title case characters to lower case. A lower case character is a character in the Unicode General Category class "Ll" and upper case character is a character in the Unicode General Category class "Lu". <End of contents from SQL spec> From the information above, we see that SQL specification leaves CEF and collation for SQL identifiers as implementation-defined but I donot see it saying specifically that case folding as implementation-defined. Even the section 4.2.3.1, Pg 19, second paragraph, talks about converting case in a generic manner in the context of UNICODE and not English locale. So, I am not sure why Derby/Cloudscape chose to use English locale to do case conversion of SQL identifiers. Derby's StringUtil class, where the SQL case conversion code lies, has following comment // The functions below are used for uppercasing SQL in a consistent manner. // Cloudscape will uppercase Turkish to the English locale to avoid i // uppercasing to an uppercase dotted i. In future versions, all // casing will be done in English. The result will be that we will get // only the 1:1 mappings in // http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.txt // and avoid the 1:n mappings in //http://www.unicode.org/Public/3.0-Update1/SpecialCasing-3.txt // // Any SQL casing should use these functions Dan, you mentioned in one of your comments to this Jira entry that "Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issue with other languages". Can you please explaing what you mean by unexpected issues? Is that the same reason for recommending same behavior for system tables? Mamta> Dan, you mentioned in one of your comments to this Jira entry that "Currently the uppercasing of SQL statements and identifiers is fixed as English to avoid unexpected issue with other languages". Can you please explaing what you mean by unexpected issues? Is that the same reason for recommending same behavior for system tables?
For the reasons you discovered in StringUtil and quoted above. The Turkish locale changes how a lower case i is upper-cased. Allowing the locale of the database engine to influence the casing of the identifiers can lead to the application having to have different versions of its SQL depending on the locale of the database or more likely, if the application developer is not aware of this issue, unexpected failures create table customer( id int) SELECT ID FROM CUSTOMER - will fail if upper casing of identifiers is Turkish. Yes, one could try to be consistent in the application, but the schema and application may be developed by different groups. The app developer may only learn of the schema through JDBC metadata thus only knowing that the column is called 'ID'. Thanks, Mamta and Dan, for clarifying what the SQL Standard says here and for spelling out the issues with our implementation. In order to support SQL collations in the future, we will need to be careful to design this so that it will be clear that the string columns in the system catalogs have a different collation than user-defined string columns.
Agreed that careful design is needed, but there is nothing special about system columns.
SQL supports schema specific collation, thus in the future Derby could support UNICODE collation in schema SALES and case insensitive collation in schema ENGINEERING. The different collation for system schemas is (should be) just an instance of this (without the ability for the user to define schema specific collation). Having worked on the DatabaseMetaDataTest and Saurabh fixing
Dan, can you share your thoughts on why recent work on DatabaseMetaDataTest and Saurabh fixing
Another issue I have been thinking about is what would happen when character type columns with different collation are compared. For instance
Let's say, a database is created to use a territory based collation for character types. And say there is a userSchema schema in that database which has a table tableInfo with column tablename defined as VARCHAR. This tableInfo.tablename will have territory based collation assoicated with it. And say this column is then compared with a VARCHAR column in SYS schema, then how will the comparison happen, since the 2 columns being compared have different collation associated with them? I haven't looked in SQL spec yet. Does anyone has any thoughts on this comparison? select * from sys.systables and userSchema.tableInfo where systables.tablename = tableInfo,tablename Hi Mamta,
As I read part 2 of the SQL Standard, it looks like you need a CAST in order to compare 2 strings which have different collations bound to them. Both string operands must have the same collation--that is my reading of Syntax rule 3b in section 9.13. Sections 6.12 and 6.1 explain how to cast the operands so that you can compare them. I think you need to write an expression like this: WHERE userStringCol = CAST ( systemStringCol AS VARCHAR COLLATE userStringColumnsCollation ) Here's an example I googled up: http://docs.openlinksw.com/virtuoso/sqlrefDATATYPES.html. Hope this helps. I looked through the DatabaseMetaDataTest and found that LONG VARCHAR and CLOB have DatabaseMetaData.typePredChar defined on them which means that a user can use WHERE...LIKE on such columns. Since one of the operations that is impacted by this Jira entry is LIKE comparison, LONG VARCHAR and CLOB will also need to be aware of their collation just like CHAR and VARCHAR. I will update the functional specification to include LONG VARCHAR and CLOB alongwith CHAR and VARCHAR. I am attaching version 2 of the functional spec based on these changes (
Mamta A. Satoor made changes - 13/Feb/07 08:39 AM
Fixed the typo in the url that pointed to writing functions to achieve collation. The url http://wiki.apache.org/db-derby/LanguageBasedOrdering had a comma at the end and hence it won't find the page.
Mamta A. Satoor made changes - 15/Feb/07 09:58 PM
Updated the functional spec with following changes and attached it as
1)The valid values for collation attributes are UCS_BASIC and TERRITORY_BASED 2)Updated the spec to say that collation attribute will be saved as a database property.
Mamta A. Satoor made changes - 24/Feb/07 03:28 PM
Do we envision using this feature to provide case-insensitive searching support? For example,
if I had a column in my database named "STATE_NAME", and I wanted to set things up so that "WHERE STATE_NAME='New York'" and "WHERE STATE_NAME='NEW YORK'" both selected the same set of rows, could I use this collation feature to do that? The code work that will go in for this feature can be extended by someone with the itch to provide case-insensitive search support. Currently, with this feature, the Collator will be determined by Derby code from the territory and it will be passed to Collator sensitive CHAR classes so that can do collation based on that. Somebody would need to do the coding work to get the correct Collator for case-insensitive search and pass that it to Collator sensitive CHAR classes.
So, in short, at the end of this feature, an end user can't simply configure Derby to do case-insensitive search rather than territory based search. Some Derby coding will be required to enable case-insensitive search.
Daniel John Debrunner made changes - 29/Mar/07 11:38 PM
Mamta A. Satoor made changes - 18/May/07 07:27 AM
Updated the functional spec to cover
Mamta A. Satoor made changes - 18/May/07 07:40 AM
It would be fantastic if we could set the collation on a per-field basis, as in:
alter table foo modify bar varchar(10) COLLATE latin1_bin ; This way, wouldn't we be able to do a case-sensitive search for a field, even if we set a case-INsensitive collation for the database? Let me know if I can help. I agree that it will be pretty useful to provide collation on a per column basis. We probably should enter a new Jira entry for it for someone with itch to start looking at it.
Mike Matrigali made changes - 01/Jun/07 08:22 PM
Mike Matrigali made changes - 01/Jun/07 08:23 PM
Mike Matrigali made changes - 01/Jun/07 08:23 PM
Mike Matrigali made changes - 01/Jun/07 08:24 PM
Mike Matrigali made changes - 01/Jun/07 08:25 PM
Mike Matrigali made changes - 01/Jun/07 08:25 PM
Daniel John Debrunner made changes - 15/Jun/07 06:16 PM
Should there be a release note for this issue?
Kathey Marsden made changes - 22/Jun/07 10:48 PM
Kathey Marsden made changes - 22/Jun/07 10:52 PM
Daniel John Debrunner made changes - 27/Jun/07 02:18 PM
Mamta A. Satoor made changes - 06/Jul/07 09:06 AM
Mamta A. Satoor made changes - 06/Jul/07 09:55 AM
Myrna van Lunteren made changes - 07/Jul/07 07:42 PM
Mike Matrigali made changes - 11/Jul/07 10:32 PM
Mike Matrigali made changes - 31/Jul/07 08:33 PM
Can this issue be resolved?
Kathey, the subtask
I have created DERBY-3651 (which is a jira entry for the subtask
Mamta A. Satoor made changes - 29/Apr/08 06:20 PM
Mamta A. Satoor made changes - 29/Apr/08 06:20 PM
Rick Hillegas made changes - 26/Aug/08 06:11 PM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DERBY-1478is a general request for built-in language based ordering and like processing for DERBY.DERBY-533is one possible implementation.