Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
4.0.0
Description
Arguably this is a correctness issue, though we haven't released collation feature yet.
collation introduces the concept of binary (in)equality, which means in some collation we no longer be able to just compare the binary format of two UnsafeRows to determine equality.
For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive collation.
State store is basically key-value storage, and the most provider implementations rely on the fact that all the columns in the key schema support binary equality. We need to disallow using binary inequality column in the key schema, before we could support this in majority of state store providers (or high-level of state store.)
Why this is correctness issue? For example, streaming aggregation will produce an output of aggregation which does not care about the semantic equality.
e.g. df.groupBy(strCol).count()
Although strCol is case insensitive, 'a' and 'A' won't be counted together in streaming aggregation, while they should be.
Attachments
Issue Links
- links to