Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46837 String function support (parent)
  3. SPARK-48937

Fix collation support for the StringToMap expression (binary & lowercase collation only)

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL

    Description

      Enable collation support for StringToMap built-in string function in Spark (str_to_map). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the complexTypeCreator.scala file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters.

       

      Example 1.

      SELECT str_to_map('a:1,b:2,c:3', ',', ':');

      This query will give the correct result, regardless of the collation.

      {"a":"1","b":"2","c":"3"}

       

      Example 2.

      SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');

      This query will give the incorrect result, under UTF8_LCASE collation. The correct result should be:

      {"a":"1","b":"2","c":"3"}

       

      Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as PostgreSQL.

       

      The goal for this Jira ticket is to implement the StringToMap expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

       

      Read more about ICU Collation Concepts and Collator class. Also, refer to the Unicode Technical Standard for string collation.

      Attachments

        Issue Links

          Activity

            People

              uros-db Uroš Bojanić
              uros-db Uroš Bojanić
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: