[SPARK-46830] Introducing collation concept into Spark - ASF JIRA

XML

Word

Printable

JSON

This feature will introduce collation support to the Spark engine. This means that:

Every StringType will have an associated collation. Default remains UTF8 Binary, which will behave under the same rules as current UTF8 String comparison.
Collation will be respected in all collation sensitive operations - comparisons, hashing, string operations (contains, startWith, endsWith etc.)
Collation can be set through following ways:
1. COLLATE expression. e.g. strExpr COLLATE collation_name
2. In CREATE TABLE column definition
3. By setting session collation.
All the Spark operators need to respect collation settings (filters, joins, shuffles, aggs etc.)

This is a high level description of the feature. You can find detailed design under this link (doc is in attachment as well).