Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34552

Support message deduplication for input data sources

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      My main proposal is to have duplicate message suppression logic as a part of Flink Table API to be able to suppress duplicates from the input sources. It might be a parameter provided by the user if they want to suppress duplicates from the input source or not. Below I provided more details about my use case and available approaches.

       

      I have a Flink job which reads from two keyed kafka topics and emits messages to the keyed kafka topic. The Flink job executes the join query:

      SELECT a.id, adata, bdata

      FROM a

      JOIN b

      ON a.id = b.id

       

      One of the input kafka topics produces messages with duplicate payload within PK in addition to meaningful data. That causes duplicates in the output topic and creates extra load to the downstream services.

       

      I was looking for a way to suppress duplicates and I found two strategies which doesn't seem to work for my use case:

      1. Based on the deduplication window as a kafka[ sink buffer|https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/table/ReducingUpsertSink.java#L46] for example. The Deduplication window doesn't work well for my case because the interval between duplicates is one day and I don't want my data to be delayed if I use such a big window.

       

      1. Using ROW_NUMBER . Unfortunately, this approach doesn't suit my use case either. Kafka topics a and b are CDC data streams and contain DELETE and REFRESH messages. If DELETE and REFRESH messages are coming with the same payload the job will suppress the last message which will lead to the incorrect output result. If I add message_type to the PARTITION key then the job will not be able to process messages sequences like this: DELETE->REFRESH->DELETE (with the same payload and PK), because the last message will be suppressed which will lead to the incorrect output result.

       

      Finally, I had to create a separate custom Flink service which reads the output topic of the initial job and suppresses duplicates keeping hashes of the last processed message for each PK in the Flink state.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            anohovsky Sergey Anokhovskiy

            Dates

              Created:
              Updated:

              Slack

                Issue deployment