Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47776

State store operation cannot work properly with binary inequality collation

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    Description

      Arguably this is a correctness issue, though we haven't released collation feature yet.

      collation introduces the concept of binary (in)equality, which means in some collation we no longer be able to just compare the binary format of two UnsafeRows to determine equality.

      For example, 'aaa' and 'AAA' can be "semantically" same in case insensitive collation.

      State store is basically key-value storage, and the most provider implementations rely on the fact that all the columns in the key schema support binary equality. We need to disallow using binary inequality column in the key schema, before we could support this in majority of state store providers (or high-level of state store.)

      Why this is correctness issue? For example, streaming aggregation will produce an output of aggregation which does not care about the semantic equality.

      e.g. df.groupBy(strCol).count() 

      Although strCol is case insensitive, 'a' and 'A' won't be counted together in streaming aggregation, while they should be.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kabhwan Jungtaek Lim Assign to me
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment