Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46837 String function support (parent)
  3. SPARK-47410

refactor UTF8String and CollationFactory



    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL


      This ticket addresses the need to refactor the UTF8String and CollationFactory classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions.

      The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase.


      To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow:


      // 1. Collation-aware expression implementation


      • should serve as a static entry point for collation-aware expressions, providing custom support
      • for example: one by one Spark expression with corresponding collation support
      • also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods


      • should continue to serve as a static provider for high-level collation interface
      • for example: interacting with external ICU components such as Collator, StringSearch, etc.
      • also note that: no low-level / expression-specific code should generally be found here


      • should be largely collation-unaware, and generally be used only as storage, nothing else
      • for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare)
      • also note that: no collation-aware operation implementations (using collationId) should be put here

      stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala)

      • should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport
      • for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc.
      • also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId


      // 2. Collation-aware expression testing


      • should be used for testing more general collation concepts
      • for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc.
      • also note that: no extra tests should generally be added for newly supported string expressions


      • should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases
      • for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU)
      • also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java

      CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala

      • should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour
      • for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc.
      • also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala


      // 3. Closing notes

      • Carefully think about performance implications of newly added custom collation-aware expression implementation
      • for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.)
      • also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can)


      • Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.)
      • for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc.
      • also note that: all similar tests should be uniform & readable and be kept in one place for various expressions


      • Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.)
      • for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions
      • also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes


        Issue Links



              uros-db Uroš Bojanić
              uros-db Uroš Bojanić
              0 Vote for this issue
              3 Start watching this issue