Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
This ticket addresses the need to refactor the UTF8String and CollationFactory classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions.
The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase.
To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow:
// 1. Collation-aware expression implementation
CollationSupport.java
- should serve as a static entry point for collation-aware expressions, providing custom support
- for example: one by one Spark expression with corresponding collation support
- also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods
CollationFactory.java
- should continue to serve as a static provider for high-level collation interface
- for example: interacting with external ICU components such as Collator, StringSearch, etc.
- also note that: no low-level / expression-specific code should generally be found here
UTF8String.java
- should be largely collation-unaware, and generally be used only as storage, nothing else
- for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare)
- also note that: no collation-aware operation implementations (using collationId) should be put here
stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala)
- should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport
- for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc.
- also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId
// 2. Collation-aware expression testing
CollationSuite.scala
- should be used for testing more general collation concepts
- for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc.
- also note that: no extra tests should generally be added for newly supported string expressions
CollationSupportSuite.java
- should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases
- for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU)
- also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java
CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala
- should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour
- for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc.
- also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala
// 3. Closing notes
- Carefully think about performance implications of newly added custom collation-aware expression implementation
- for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.)
- also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can)
- Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.)
- for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc.
- also note that: all similar tests should be uniform & readable and be kept in one place for various expressions
- Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.)
- for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions
- also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes