Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10348

Solve data skew when consuming data from kafka

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: Connectors / Kafka
    • Labels:
      None

      Description

      By using KafkaConsumer, our strategy is to send fetch request to brokers with a fixed fetch size. Assume x topic has n partition and there exists data skew between partitions, now we need to consume data from x topic with earliest offset, and we can get max fetch size data in every fetch request. The problem is that when an task consumes data from both "big" partitions and "small" partitions, the data in "big" partitions may be late elements because "small" partitions are consumed faster.

      *Solution: *
      I think we can leverage two parameters to control this.
      1. data.skew.check // whether to check data skew
      2. data.skew.check.interval // the interval between checks
      Every data.skew.check.interval, we will check the latest offset of every specific partition, and calculate (latest offset - current offset), then get partitions which need to slow down and redefine their fetch size.

        Attachments

          Activity

            People

            • Assignee:
              wind_ljy Jiayi Liao
              Reporter:
              wind_ljy Jiayi Liao
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: