Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32554

Facilitate slot isolation and resource management for global committer



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16.2
    • None
    • None


      Flink's global committer executes unique workload compared to the source and sink operators. In some use cases, it may require much higher amount of resources (CPU, memory) than other operators. However, according to this source code, currently it is not possible to isolate the global committer to a dedicated task manager or task slot, or assign more resources to it by leveraging the fine grained resource management. Flink would always make the global committer task share with another task in a task slot. (In one test, we tried to have one more task slot than required by the source/sink parallelism, but Flink still assigns the global committer to share a slot with another task.)

      As a result, we often see CPU utilization spike on the task manger that runs the global committer compared with other task managers and becomes the bottleneck for the job. Due to slot sharing and inadequate resources on the global committer, the job takes long time to initialize upon restarting and the checkpoints take long time to complete. Our job consumes from Kafka and this bottleneck causes significant increase of consumer lag. The lag in turn causes the Kafka source operator to replay backlogs, causing more CPU consumption on the source operator and making it worse for the global committer that runs in the same task slot.

      At minimum, we want the capability to configure the global committer to run in its own task slot, and make that work under reactive scaling. It would also be great to make the fine grained resource management working for global committer.







            Unassigned Unassigned
            allenxwang Allen Wang
            0 Vote for this issue
            2 Start watching this issue