[DERBY-6607] Derby is using territory/collation for equality, not just ordering (incorrectly?) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 10.10.2.0
Fix Version/s: None
Component/s: Localization
Labels:
None

Bug behavior facts:

Deviation from standard, Seen in production

Description

We have a database where we wish case-insensitivity, and therefore it was created with collation=TERRITORY_BASED:PRIMARY. We have customers in both the United States (en_US) and in Japan (ja_JP).

We have an issue in Japan. Japanese has three character sets: hiragana, katakana, and kanji. Hiragana is a phonetic alphabet with 46 letters. Katakana is an identical phonetic alphabet with 46 letters, written using different character forms, and used for foreign words (words adopted from other languages into Japanese).

Here is the word 'cake' written in katakana: ケーキ (ke- ki)
Here is the word 'cake' written in hiragana: けーき (ke- ki)

In terms of collation (ordering), Japanese consider these to be equal. So, in the following Java code, the call to 'compare()' would return 0:

Collator collator = Collator.getInstance(Locale.JAPAN);
collator.setStrength(Collator.PRIMARY);
return collator.compare("ケーキ", "けーき");

And therein lies the issue. With respect to ordering they are indeed equivalent, however Japanese would consider them district (non-equivalent) values.

When a table is declared with a UNIQUE constraint on a column, or a PRIMARY KEY column, if 'ケーキ' exists in the table, Derby will throw a unique constraint violation upon an attempt to insert 'けーき'.

We need collation=TERRITORY_BASED:PRIMARY or TERRITORY_BASED:SECONDARY for case-insensitivity and at the same time need these values to be treated as unique. It is as if String.equals() should be used if the lvalue or rvalue of an = operator is Japanese, but should use Collator.equals() if both the lvalue and rvalue are "ascii-betical". The same for constraint checking.

Is it "correct" that Derby use the collation when determining value equivalency vs. ordering equivalency?

At the same time, I understand that this is tricky. Japanese has no "upper-case" and "lower-case" for hiragana, katakana, or kanji, however they do use "romanji" (roman characters) which are essentially ASCII, which is case-sensitive. Collation is merely used for ordering. So when TERRITORY_BASED:PRIMARY/SECONDARY is used, for Japanese, 'cat' and 'CAT' would be equivalent but 'ケーキ' and 'けーき' would not be. Unfortunately, there is only one Collator and it will identify both of these as equivalent.

Taking the example further, imagine a database with collation=TERRITORY_BASED:SECONDARY, and tags table without a unique constraint, but containing the following values:

Tag
-----------------------
Cat
cat
ケーキ
けーき

The following SQL should delete both cats:

DELETE FROM tags WHERE tag='cAT'

But from the Japanese perspective, the following code would erroneously delete both cakes:

DELETE FROM tags WHERE tag='ケーキ'

They consider the two expressions of the word cake distinct, but consider the two cats as equivalent. The Collator considers them all equivalent. It is as if String.equals() should be used if the lvalue or rvalue of an = operator is Japanese, and use Collator.equals() if the lvalue and rvalue are "ascii-betical".

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Brett Wooldridge

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Jun/14 07:34

Updated:: 13/Jun/14 08:40

Resolved:: 13/Jun/14 00:30