Derby
  1. Derby
  2. DERBY-2699

performance of like in territory based collation databases may be improved by changing way collation elements are calculated.

    Details

    • Urgency:
      Normal

      Description

      WorkHorseForCollatorDatatypes.java has a method getCollationElementsForString() which currently gets
      called when processing like clauses in databases that have been created with territory based collation, this is
      not an issue in pre-10.3 databases or post 10.3 default databases.
      getCollationElementsForString gets the collation elements for the entire value of the String held by
      the datatype using the class.

      If you take the case of pattern 'A%' and the value of datatype is 'BXXXXXXXXXXXXXXXXXXXXXXX',
      then it would have been better to better to get collation elements one character of the String value at a time
      to avoid the process of getting collation elements for the entire string when we don't really need it
      One could imagine this might have a huge performance impact on running like against a long clob where
      the like pattern has leading fixed-length pattern to match.

      Comments on this from Dan and Dag can be found in DERBY-2416.

      1. d2699-1a.diff
        12 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 1510373 from Mamta A. Satoor in branch 'code/branches/10.8'
          [ https://svn.apache.org/r1510373 ]

          DERBY-2699(performance of like in territory based collation databases may be improved by changing way collation elements are calculated.)

          Backporting to 10.8. Fix contributed by Knut Anders Hatlen

          Show
          ASF subversion and git services added a comment - Commit 1510373 from Mamta A. Satoor in branch 'code/branches/10.8' [ https://svn.apache.org/r1510373 ] DERBY-2699 (performance of like in territory based collation databases may be improved by changing way collation elements are calculated.) Backporting to 10.8. Fix contributed by Knut Anders Hatlen
          Hide
          ASF subversion and git services added a comment -

          Commit 1510366 from Mamta A. Satoor in branch 'code/branches/10.9'
          [ https://svn.apache.org/r1510366 ]

          DERBY-2699(performance of like in territory based collation databases may be improved by changing way collation elements are calculated.)

          Backporting to 10.9. Fix contributed by Knut Anders Hatlen

          Show
          ASF subversion and git services added a comment - Commit 1510366 from Mamta A. Satoor in branch 'code/branches/10.9' [ https://svn.apache.org/r1510366 ] DERBY-2699 (performance of like in territory based collation databases may be improved by changing way collation elements are calculated.) Backporting to 10.9. Fix contributed by Knut Anders Hatlen
          Hide
          Mamta A. Satoor added a comment -

          Working on backporting this jira

          Show
          Mamta A. Satoor added a comment - Working on backporting this jira
          Hide
          Knut Anders Hatlen added a comment -

          Committed revision 1428305.

          I'm resolving the issue now since we no longer retrieve all collation elements.

          Show
          Knut Anders Hatlen added a comment - Committed revision 1428305. I'm resolving the issue now since we no longer retrieve all collation elements.
          Hide
          Knut Anders Hatlen added a comment -

          All the regression tests ran cleanly.

          Show
          Knut Anders Hatlen added a comment - All the regression tests ran cleanly.
          Hide
          Knut Anders Hatlen added a comment -

          DERBY-3136 improved the LIKE implementation along the lines suggested in this issue, so now WorkHorseForCollatorDatatypes.getCollationElementsForString() is only used for checking that the ESCAPE clause of a LIKE ... ESCAPE ... expression does not contain more than a single collation element (for example to disallow 'ß' in an ESCAPE clause, as it has two collation elements).

          Since it's only used for ESCAPE clauses now, and they are typically just a single character, the performance benefits are probably not that big anymore. But we can simplify how the collation elements are calculated now that we only need to check if it's a single element. For example, there is no need to have an intermediate int[] representation of the collation elements.

          Attached is a patch that removes the getCollationElementsForString() and getCountOfCollationElements() methods from WorkHorseForCollatorDatatypes, CollationElementsInterface, and all classes that implement CollationElementsInterface. Those methods are replaced by a simpler hasSingleCollationElement() method.

          This shrinks the source files by approximately 100 lines in addition to reducing the number of objects allocated when evaluating LIKE ... ESCAPE with territory based collation, which might (perhaps) slightly improve the performance.

          I'm running the full regression test suite on the patch now.

          Show
          Knut Anders Hatlen added a comment - DERBY-3136 improved the LIKE implementation along the lines suggested in this issue, so now WorkHorseForCollatorDatatypes.getCollationElementsForString() is only used for checking that the ESCAPE clause of a LIKE ... ESCAPE ... expression does not contain more than a single collation element (for example to disallow 'ß' in an ESCAPE clause, as it has two collation elements). Since it's only used for ESCAPE clauses now, and they are typically just a single character, the performance benefits are probably not that big anymore. But we can simplify how the collation elements are calculated now that we only need to check if it's a single element. For example, there is no need to have an intermediate int[] representation of the collation elements. Attached is a patch that removes the getCollationElementsForString() and getCountOfCollationElements() methods from WorkHorseForCollatorDatatypes, CollationElementsInterface, and all classes that implement CollationElementsInterface. Those methods are replaced by a simpler hasSingleCollationElement() method. This shrinks the source files by approximately 100 lines in addition to reducing the number of objects allocated when evaluating LIKE ... ESCAPE with territory based collation, which might (perhaps) slightly improve the performance. I'm running the full regression test suite on the patch now.
          Hide
          Daniel John Debrunner added a comment -

          I think the approach of getting collation elements as needed would have a large affect on all string comparisons.

          I created a scale 4 order entry database with and without a collated database. Just looking at the load collation will only affect 'index.sql' which creates an index including the customer's last name. With UCS_BASIC collation the index created in about 2.5 seconds, with TERRITORY_BASED collation the time was over 11 seconds.

          I don't think that the collation overhead should be that high, I would expect maybe a 10-20% overhead, not around 450%

          Show
          Daniel John Debrunner added a comment - I think the approach of getting collation elements as needed would have a large affect on all string comparisons. I created a scale 4 order entry database with and without a collated database. Just looking at the load collation will only affect 'index.sql' which creates an index including the customer's last name. With UCS_BASIC collation the index created in about 2.5 seconds, with TERRITORY_BASED collation the time was over 11 seconds. I don't think that the collation overhead should be that high, I would expect maybe a 10-20% overhead, not around 450%

            People

            • Assignee:
              Mamta A. Satoor
              Reporter:
              Mike Matrigali
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development