Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42931

dropDuplicates within watermark

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.0
    • 3.5.0
    • Structured Streaming
    • None

    Description

      We got many reports that dropDuplicates does not clean up the state even though they have set a watermark for the query. We document the behavior clearly that the event time column should be a part of the subset columns for deduplication to clean up the state, but it cannot be applied to the customers as timestamps are not exactly the same for duplicated events in their use cases.

      We propose to deduce a new API of dropDuplicates which has following different characteristics compared to existing dropDuplicates:

      • Weaker constraints on the subset (key)
        • Does not require an event time column on the subset.
      • Looser semantics on deduplication
        • Only guarantee to deduplicate events within the watermark.

      Since the new API leverages event time, the new API has following new requirements:

      • The input must be streaming DataFrame.
      • The watermark must be defined.
      • The event time column must be defined in the input DataFrame.

      More specifically on the semantic, once the operator processes the first arrived event, events arriving within the watermark for the first event will be deduplicated.
      (Technically, the expiration time should be the “event time of the first arrived event + watermark delay threshold”, to match up with future events.)

      Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events. (If they are unsure, they can alternatively set the delay threshold large enough, e.g. 48 hours.)

      Longer design doc will be attached.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kabhwan Jungtaek Lim Assign to me
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            2 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment