Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46837 String function support (parent)
  3. SPARK-47418

Optimize string predicate expressions for UTF8_BINARY_LCASE collation

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL

    Description

      Implement contains, startsWith, and endsWith built-in string Spark functions using optimized lowercase comparison approach introduced by nikolamand-db in https://github.com/apache/spark/pull/45816. Refer to the latest design and code structure imposed by uros-db in https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation support is introduced for Spark SQL expressions. In addition, review previous Jira tickets under the current parent in order to understand how StringPredicate expressions are currently used and tested in Spark:

      These tickets should help you understand what changes were introduced in order to enable collation support for these functions. Lastly, feel free to use your chosen Spark SQL Editor to play around with the existing functions and learn more about how they work.

       

      The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE implementation for the contains, startsWith, and endsWith functions so that they use optimized lowercase comparison approach (following the general logic in Nikola's PR), and benchmark the results accordingly. As for testing, the currently existing unit test cases and end-to-end tests should already fully cover the expected behaviour of StringPredicate expressions for all collation types. In other words, the objective of this ticket is only to enhance the internal implementation, without introducing any user-facing changes to Spark SQL API.

       

      Finally, feel free to refer to the Unicode Technical Standard for string searching and collation.

      Attachments

        Activity

          People

            uros-db Uroš Bojanić
            uros-db Uroš Bojanić
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: