[FLINK-4545] Flink automatically manages TM network buffer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.0
Component/s: Runtime / Network
Labels:
None

Description

Currently, the number of network buffer per task manager is preconfigured and the memory is pre-allocated through taskmanager.network.numberOfBuffers config. In a Job DAG with shuffle phase, this number can go up very high depends on the TM cluster size. The formula for calculating the buffer count is documented here (https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#configuring-the-network-buffers).

#slots-per-TM^2 * #TMs * 4

In a standalone deployment, we may need to control the task manager cluster size dynamically and then leverage the up-coming Flink feature to support scaling job parallelism/rescaling at runtime.
If the buffer count config is static at runtime and cannot be changed without restarting task manager process, this may add latency and complexity for scaling process. I am wondering if there is already any discussion around whether the network buffer should be automatically managed by Flink or at least expose some API to allow it to be reconfigured. Let me know if there is any existing JIRA that I should follow.

Attachments

Issue Links

links to

GitHub Pull Request #3467

GitHub Pull Request #3480

GitHub Pull Request #3721

Activity

People

Assignee:: Nico Kruber

Reporter:: Zhenzhong Xu

Votes:: 1 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 31/Aug/16 18:45

Updated:: 13/Apr/21 20:31

Resolved:: 06/May/17 17:51