>> (1) all the characters before the wildcard map to exactly one
>> collation element,
> Need to add 'each' at the end, right, just to be clear?
> Though I maybe having trouble understanding what this is trying to
> say exactly. All single characters will map to a single collation
> element by themselves, so I think it's really trying to say that
> none of the sequence of characters in the prefix combine into a
> single collation element.???
Yes, I think I messed up the terminology here. Actually, a single
character can map to a sequence of collation elements, for instance
the character 'ä' maps to two collation elements in German locale
(mentioned in the class javadoc for CollationElementIterator). But I
think you're right that we only need special handling for the
multi-character sequences that combine into a different sequence of
collation elements than what you get by concatenating the collation
elements for each single character in the sequence.
So to use the terminology from RuleBasedCollator, the problematic
sequences are a subset of those text-arguments whose length is greater
Take for instance this fragment of the return value from getRules()
with German (de_DE) collation:
r,R<s, S & SS,ß<t,T& TH, Þ &TH, þ <u,U
In the rules above, we have two multi-character text-arguments: SS and
TH. None of them are problematic, however, since the collation
elements for 'SS' are the same as the collation elements for 'S'
concatenated with the collation elements for another 'S', and the
collation elements for 'TH' are the same as for 'T' followed by 'H'.
Here's another fragment from Norwegian (no_NO) collation:
õ, Õ < å, Å, aa , aA , Aa , AA & V < w, W
All of the two-character sequences above (aa, aA, Aa, AA) are
problematic because each of them results in a single collation element
and therefore doesn't match the collation elements when each character
is considered on its own.
>> col < 'abcde\uFFFF'
> \uFFFF is incorrect since it is not a valid character and the
> numeric codepoint of the character is not what the index is using
> for ordering. You really need
> col < 'abcdf'
> where 'f' is the character determined from a collation element value
> than is greater than the one for 'e'
Right. I don't think it's a problem that \uFFFF is an invalid
character, since that's the character that's used by the current LIKE
optimization. (Perhaps it's because it is an invalid character we can
get away with it when we have collation=UCS_BASIC? I would have
expected that we needed an infinitely long string containing \uFFFF
characters, not just a single character, to get a proper upper limit.)
But you're right that if the RuleBasedCollator has explicitly defined
an ordering for \uFFFF we cannot use it as an upper limit. I'm not
sure what the best way to find that upper limit is. For the characters
mentioned in the rules we get from getRules(), it should be trivial,
as we can get the full ordering when we parse the rules. But a large
number of the characters aren't mentioned in the rules. For instance,
the rules for de_DE don't contain any of the characters with accents,
as their ordering is determined by the decomposition mode of the
collator, nor do they contain letters from non-Latin alphabets.