[SPARK-47410] refactor UTF8String and CollationFactory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

This ticket addresses the need to refactor the UTF8String and CollationFactory classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions.

The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase.

To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow:

// 1. Collation-aware expression implementation

CollationSupport.java

should serve as a static entry point for collation-aware expressions, providing custom support
for example: one by one Spark expression with corresponding collation support
also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods

CollationFactory.java

should continue to serve as a static provider for high-level collation interface
for example: interacting with external ICU components such as Collator, StringSearch, etc.
also note that: no low-level / expression-specific code should generally be found here

UTF8String.java

should be largely collation-unaware, and generally be used only as storage, nothing else
for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare)
also note that: no collation-aware operation implementations (using collationId) should be put here

stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala)

should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport
for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc.
also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId

// 2. Collation-aware expression testing

CollationSuite.scala

should be used for testing more general collation concepts
for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc.
also note that: no extra tests should generally be added for newly supported string expressions

CollationSupportSuite.java

should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases
for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU)
also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java

CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala

should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour
for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc.
also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala

// 3. Closing notes

Carefully think about performance implications of newly added custom collation-aware expression implementation
for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.)
also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can)

Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.)
for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc.
also note that: all similar tests should be uniform & readable and be kept in one place for various expressions

Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.)
for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions
also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes

Attachments

Issue Links

links to

GitHub Pull Request #45978

GitHub Pull Request #47622

GitHub Pull Request #47671

Activity

People

Assignee:: Uroš Bojanić

Reporter:: Uroš Bojanić

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Mar/24 11:55

Updated:: 09/Aug/24 01:22

Resolved:: 11/Apr/24 14:21