Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.15.2
-
None
-
None
-
Flink 1.15.2, AWS EMR.
Description
We're running a relatively small flink cluster (7 task-managers * 8 cores) and are using datadog for telemetry.
The numbers for outgoing traffic, between kafka producers, tasks activities, and host system metrics didn't add-up. After investigation, we discovered that this traffic was generated by the DatadogHttpReporter.
We switched the reporter to an implementation using the java dogstatsd client (reporting to a datadog agent on each host).
Here are some numbers of outgoing traffic taken at a NAT gateway, between the cluster and the outside world. Before/after this change (all other things being equal):
We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic is not free on AWS...
Here is the change on `flink.taskmanager.Status.JVM.CPU.Load` (over the whole cluster)
Reporting telemetry in json over http has a HUGE overhead.
So I would strongly advocate to deprecate this reporter, and recommend users to use a dogstatsd-based implementation. There exist one (https://github.com/aroch/flink-metrics-dogstatsd, not tested). On our side, we developed our own that we can share if requested.