Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Enable collation support for the SubstringIndex built-in string function in Spark. First confirm what is the expected behaviour for these functions when given collated strings, and then move on to implementation and testing. One way to go about this is to consider using StringSearch, an efficient ICU service for string matching. Implement the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as PostgreSQL.
The goal for this Jira ticket is to implement the SubstringIndex functions so that they support all collation types currently supported in Spark. To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the Spark PRs and Jira tickets for completed tasks in this parent (for example: Contains, StartsWith, EndsWith).
Read more about ICU Collation Concepts and Collator class, as well as StringSearch using the ICU user guide and ICU docs. Also, refer to the Unicode Technical Standard for string searching and collation.
Attachments
Issue Links
- links to