[SPARK-47776] State store operation cannot work properly with binary inequality collation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Structured Streaming
Labels:
- pull-request-available

Description

Arguably this is a correctness issue, though we haven't released collation feature yet.

collation introduces the concept of binary (in)equality, which means in some collation we no longer be able to just compare the binary format of two UnsafeRows to determine equality.

For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive collation.

State store is basically key-value storage, and the most provider implementations rely on the fact that all the columns in the key schema support binary equality. We need to disallow using binary inequality column in the key schema, before we could support this in majority of state store providers (or high-level of state store.)

Why this is correctness issue? For example, streaming aggregation will produce an output of aggregation which does not care about the semantic equality.

e.g. df.groupBy(strCol).count()

Although strCol is case insensitive, 'a' and 'A' won't be counted together in streaming aggregation, while they should be.

Attachments

Issue Links

links to

GitHub Pull Request #45951

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Apr/24 06:57

Updated:: 10/Apr/24 04:38

Resolved:: 10/Apr/24 04:38