XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL

    Description

      Implement startsWith and endsWith built-in string Spark functions using StringSearch, an efficient ICU service for string matching. Refer to the latest unit tests in CollationSuite to understand how these functions are used in SparkSQL, and feel free to use your chosen Spark SQL Editor to play around with the existing functions to learn more about how they work.

       

      Currently, these 2 functions support all collation types:

      1. binary collations (UCS_BASIC, UNICODE) *special cases - these collation types work using the existing string comparison functions - i.e. contains(), startsWith(), endsWith()
      2. special lowercase non-binary collations (UCS_BASIC) *special case - these collation types work by using lower() to convert both strings to lowercase, and then use above functions
      3. other non-binary collations (UNICODE_CI; special collations for various languages with case and accent sensitivity) - these collation types usually require special handling, which can sometimes be complex

       

      To understand what changes were introduced in order to enable collation support for these functions, take a look at the Spark PRs and Jira tickets below:

      • https://github.com/apache/spark/pull/45216 this PR enables:
        • partial collation support for contains (skipping the 3rd type of collations shown above)
        • complete collation support for startsWith, endsWith (using a special matchAt implementation directly in UTF8String)

       

      Focusing on the 3rd type of collations as shown above, the goal for this Jira ticket is to re-implement the startsWith and endsWith functions so that they use StringSearch instead (following the general logic in the second PR). As for the current test cases in CollationSuite, they should already mostly cover the expected behaviour of startsWith and endsWith for the 3rd type of collations.

       

      Read more about StringSearch using the ICU user guide and ICU docs. Also, refer to the Unicode Technical Standard for string searching and collation.

      Attachments

        Issue Links

          Activity

            People

              stevomitric Stevo Mitric
              uros-db Uroš Bojanić
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: