Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47776

State store operation cannot work properly with binary inequality collation

    XMLWordPrintableJSON

Details

    Description

      Arguably this is a correctness issue, though we haven't released collation feature yet.

      collation introduces the concept of binary (in)equality, which means in some collation we no longer be able to just compare the binary format of two UnsafeRows to determine equality.

      For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive collation.

      State store is basically key-value storage, and the most provider implementations rely on the fact that all the columns in the key schema support binary equality. We need to disallow using binary inequality column in the key schema, before we could support this in majority of state store providers (or high-level of state store.)

      Why this is correctness issue? For example, streaming aggregation will produce an output of aggregation which does not care about the semantic equality.

      e.g. df.groupBy(strCol).count() 

      Although strCol is case insensitive, 'a' and 'A' won't be counted together in streaming aggregation, while they should be.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: