Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Implement startsWith and endsWith built-in string Spark functions using StringSearch, an efficient ICU service for string matching. Refer to the latest unit tests in CollationSuite to understand how these functions are used in SparkSQL, and feel free to use your chosen Spark SQL Editor to play around with the existing functions to learn more about how they work.
Currently, these 2 functions support all collation types:
- binary collations (UCS_BASIC, UNICODE) *special cases - these collation types work using the existing string comparison functions - i.e. contains(), startsWith(), endsWith()
- special lowercase non-binary collations (UCS_BASIC) *special case - these collation types work by using lower() to convert both strings to lowercase, and then use above functions
- other non-binary collations (UNICODE_CI; special collations for various languages with case and accent sensitivity) - these collation types usually require special handling, which can sometimes be complex
To understand what changes were introduced in order to enable collation support for these functions, take a look at the Spark PRs and Jira tickets below:
- https://github.com/apache/spark/pull/45216 this PR enables:
- partial collation support for contains (skipping the 3rd type of collations shown above)
- complete collation support for startsWith, endsWith (using a special matchAt implementation directly in UTF8String)
- https://github.com/apache/spark/pull/45382 this PR enables:
- complete collation support for contains (using StringSearch) -> now we should also use this approach for startsWith & endsWith
Focusing on the 3rd type of collations as shown above, the goal for this Jira ticket is to re-implement the startsWith and endsWith functions so that they use StringSearch instead (following the general logic in the second PR). As for the current test cases in CollationSuite, they should already mostly cover the expected behaviour of startsWith and endsWith for the 3rd type of collations.
Read more about StringSearch using the ICU user guide and ICU docs. Also, refer to the Unicode Technical Standard for string searching and collation.
Attachments
Issue Links
- links to