Uploaded image for project: 'Derby'
  1. Derby
  2. DERBY-3854

Implement LIKE transformations and optimizations for databases using territory-based collations

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.3.3.0, 10.4.1.3, 10.5.1.1
    • Fix Version/s: None
    • Component/s: SQL
    • Urgency:
      Normal

      Description

      The LIKE transformations and optimizations are disabled when using a database with a territory-based collation. See the following email thread: http://www.nabble.com/territory-based-collations-and-optimizations-for-the-LIKE-operator-td19111725.html#a19111725 That thread, in turn, refers to DERBY-1478. It would be nice if we did not have to perform full table scans for LIKE queries in databases with territory-based collations.

        Issue Links

          Activity

          Hide
          knutanders Knut Anders Hatlen added a comment -

          >> (1) all the characters before the wildcard map to exactly one
          >> collation element,
          >
          > Need to add 'each' at the end, right, just to be clear?

          Right.

          > Though I maybe having trouble understanding what this is trying to
          > say exactly. All single characters will map to a single collation
          > element by themselves, so I think it's really trying to say that
          > none of the sequence of characters in the prefix combine into a
          > single collation element.???

          Yes, I think I messed up the terminology here. Actually, a single
          character can map to a sequence of collation elements, for instance
          the character 'ä' maps to two collation elements in German locale
          (mentioned in the class javadoc for CollationElementIterator). But I
          think you're right that we only need special handling for the
          multi-character sequences that combine into a different sequence of
          collation elements than what you get by concatenating the collation
          elements for each single character in the sequence.

          So to use the terminology from RuleBasedCollator, the problematic
          sequences are a subset of those text-arguments whose length is greater
          than 1.

          Take for instance this fragment of the return value from getRules()
          with German (de_DE) collation:

          r,R<s, S & SS,ß<t,T& TH, Þ &TH, þ <u,U

          In the rules above, we have two multi-character text-arguments: SS and
          TH. None of them are problematic, however, since the collation
          elements for 'SS' are the same as the collation elements for 'S'
          concatenated with the collation elements for another 'S', and the
          collation elements for 'TH' are the same as for 'T' followed by 'H'.

          Here's another fragment from Norwegian (no_NO) collation:

          õ, Õ < å, Å, aa , aA , Aa , AA & V < w, W

          All of the two-character sequences above (aa, aA, Aa, AA) are
          problematic because each of them results in a single collation element
          and therefore doesn't match the collation elements when each character
          is considered on its own.

          >> col < 'abcde\uFFFF'
          >
          > \uFFFF is incorrect since it is not a valid character and the
          > numeric codepoint of the character is not what the index is using
          > for ordering. You really need
          >
          > col < 'abcdf'
          >
          > where 'f' is the character determined from a collation element value
          > than is greater than the one for 'e'

          Right. I don't think it's a problem that \uFFFF is an invalid
          character, since that's the character that's used by the current LIKE
          optimization. (Perhaps it's because it is an invalid character we can
          get away with it when we have collation=UCS_BASIC? I would have
          expected that we needed an infinitely long string containing \uFFFF
          characters, not just a single character, to get a proper upper limit.)

          But you're right that if the RuleBasedCollator has explicitly defined
          an ordering for \uFFFF we cannot use it as an upper limit. I'm not
          sure what the best way to find that upper limit is. For the characters
          mentioned in the rules we get from getRules(), it should be trivial,
          as we can get the full ordering when we parse the rules. But a large
          number of the characters aren't mentioned in the rules. For instance,
          the rules for de_DE don't contain any of the characters with accents,
          as their ordering is determined by the decomposition mode of the
          collator, nor do they contain letters from non-Latin alphabets.

          Show
          knutanders Knut Anders Hatlen added a comment - >> (1) all the characters before the wildcard map to exactly one >> collation element, > > Need to add 'each' at the end, right, just to be clear? Right. > Though I maybe having trouble understanding what this is trying to > say exactly. All single characters will map to a single collation > element by themselves, so I think it's really trying to say that > none of the sequence of characters in the prefix combine into a > single collation element.??? Yes, I think I messed up the terminology here. Actually, a single character can map to a sequence of collation elements, for instance the character 'ä' maps to two collation elements in German locale (mentioned in the class javadoc for CollationElementIterator). But I think you're right that we only need special handling for the multi-character sequences that combine into a different sequence of collation elements than what you get by concatenating the collation elements for each single character in the sequence. So to use the terminology from RuleBasedCollator, the problematic sequences are a subset of those text-arguments whose length is greater than 1. Take for instance this fragment of the return value from getRules() with German (de_DE) collation: r,R<s, S & SS,ß<t,T& TH, Þ &TH, þ <u,U In the rules above, we have two multi-character text-arguments: SS and TH. None of them are problematic, however, since the collation elements for 'SS' are the same as the collation elements for 'S' concatenated with the collation elements for another 'S', and the collation elements for 'TH' are the same as for 'T' followed by 'H'. Here's another fragment from Norwegian (no_NO) collation: õ, Õ < å, Å, aa , aA , Aa , AA & V < w, W All of the two-character sequences above (aa, aA, Aa, AA) are problematic because each of them results in a single collation element and therefore doesn't match the collation elements when each character is considered on its own. >> col < 'abcde\uFFFF' > > \uFFFF is incorrect since it is not a valid character and the > numeric codepoint of the character is not what the index is using > for ordering. You really need > > col < 'abcdf' > > where 'f' is the character determined from a collation element value > than is greater than the one for 'e' Right. I don't think it's a problem that \uFFFF is an invalid character, since that's the character that's used by the current LIKE optimization. (Perhaps it's because it is an invalid character we can get away with it when we have collation=UCS_BASIC? I would have expected that we needed an infinitely long string containing \uFFFF characters, not just a single character, to get a proper upper limit.) But you're right that if the RuleBasedCollator has explicitly defined an ordering for \uFFFF we cannot use it as an upper limit. I'm not sure what the best way to find that upper limit is. For the characters mentioned in the rules we get from getRules(), it should be trivial, as we can get the full ordering when we parse the rules. But a large number of the characters aren't mentioned in the rules. For instance, the rules for de_DE don't contain any of the characters with accents, as their ordering is determined by the decomposition mode of the collator, nor do they contain letters from non-Latin alphabets.
          Hide
          djd Daniel John Debrunner added a comment -

          > (1) all the characters before the wildcard map to exactly one collation element,

          Need to add 'each' at the end, right, just to be clear?

          Though I maybe having trouble understanding what this is trying to say exactly. All single characters will map to a single collation element by themselves, so I think it's really trying to say that none of the sequence of characters in the prefix combine into a single collation element.???

          > col < 'abcde\uFFFF'

          \uFFFF is incorrect since it is not a valid character and the numeric codepoint of the character is not what the index is using for ordering. You really need

          col < 'abcdf'

          where 'f' is the character determined from a collation element value than is greater than the one for 'e'

          Show
          djd Daniel John Debrunner added a comment - > (1) all the characters before the wildcard map to exactly one collation element, Need to add 'each' at the end, right, just to be clear? Though I maybe having trouble understanding what this is trying to say exactly. All single characters will map to a single collation element by themselves, so I think it's really trying to say that none of the sequence of characters in the prefix combine into a single collation element.??? > col < 'abcde\uFFFF' \uFFFF is incorrect since it is not a valid character and the numeric codepoint of the character is not what the index is using for ordering. You really need col < 'abcdf' where 'f' is the character determined from a collation element value than is greater than the one for 'e'
          Hide
          rhillegas Rick Hillegas added a comment -

          Hi Knut,

          This looks like a promising approach. It certainly seems that we ought to be able to find a longest, well-behaved leading prefix of the LIKE constant--and then build bounds from that prefix. Your rules look good to me. That is, I haven't been able to come up with a problem case. That's a bit shy of proof, of course. And I am a bit muddled by the interaction of the Collator levers, strength and decomposition.

          Show
          rhillegas Rick Hillegas added a comment - Hi Knut, This looks like a promising approach. It certainly seems that we ought to be able to find a longest, well-behaved leading prefix of the LIKE constant--and then build bounds from that prefix. Your rules look good to me. That is, I haven't been able to come up with a problem case. That's a bit shy of proof, of course. And I am a bit muddled by the interaction of the Collator levers, strength and decomposition.
          Hide
          knutanders Knut Anders Hatlen added a comment -

          Don't know if it'll work or if I use the correct terminology, but here
          are some thoughts on how we could optimize LIKE when using collation:

          The collators we use are always of type RuleBasedCollator, from which
          we can receive the rules by calling getRules(). By looking at the
          rules, we may be able to perform some optimizations. One example:

          col LIKE 'abcde%'

          If we know, by looking at the rules, that (1) all the characters
          before the wildcard map to exactly one collation element, and (2) that
          none of them are ignorable, and (3) that the last characters up to the
          wildcard cannot be the start of a sequence of characters mapping to
          one collation element, we can perform the same optimization as we can
          when we're not using collation. That is, we can add the predicate
          'abcde' <= col < 'abcde\uFFFF' to limit the scan.

          Whereas databases using collation=USC_BASIC can always use the prefix
          up to the first wildcard character for optimizations like this, I
          think databases with territory-based collation can only use the prefix
          up to the first occurrence of one of

          a) A wildcard character

          b) An ignorable character

          c) A character that maps to a sequence of collation elements

          d) A sequence of characters mapping to one collation element

          e) A sequence of characters which could be the start of a sequence
          that maps to one collation element, directly followed by a wildcard

          So with territory=no_NO, where a < z < aa, we'd optimize the
          expression

          col LIKE 'data%'

          by adding the predicate 'dat' <= col < 'dat\uFFFF' because the 'a'
          right before the '%' could be the start of the sequence 'aa' which
          maps to a single collation element (rule (e) above). (Or if we want to
          be really sophisticated, I think we can use the predicate 'data' <=
          col < 'dataa\uFFFF' in this case, but that would require a more
          thorough analysis of the collator's rules.)

          Show
          knutanders Knut Anders Hatlen added a comment - Don't know if it'll work or if I use the correct terminology, but here are some thoughts on how we could optimize LIKE when using collation: The collators we use are always of type RuleBasedCollator, from which we can receive the rules by calling getRules(). By looking at the rules, we may be able to perform some optimizations. One example: col LIKE 'abcde%' If we know, by looking at the rules, that (1) all the characters before the wildcard map to exactly one collation element, and (2) that none of them are ignorable, and (3) that the last characters up to the wildcard cannot be the start of a sequence of characters mapping to one collation element, we can perform the same optimization as we can when we're not using collation. That is, we can add the predicate 'abcde' <= col < 'abcde\uFFFF' to limit the scan. Whereas databases using collation=USC_BASIC can always use the prefix up to the first wildcard character for optimizations like this, I think databases with territory-based collation can only use the prefix up to the first occurrence of one of a) A wildcard character b) An ignorable character c) A character that maps to a sequence of collation elements d) A sequence of characters mapping to one collation element e) A sequence of characters which could be the start of a sequence that maps to one collation element, directly followed by a wildcard So with territory=no_NO, where a < z < aa, we'd optimize the expression col LIKE 'data%' by adding the predicate 'dat' <= col < 'dat\uFFFF' because the 'a' right before the '%' could be the start of the sequence 'aa' which maps to a single collation element (rule (e) above). (Or if we want to be really sophisticated, I think we can use the predicate 'data' <= col < 'dataa\uFFFF' in this case, but that would require a more thorough analysis of the collator's rules.)

            People

            • Assignee:
              Unassigned
              Reporter:
              rhillegas Rick Hillegas
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development