[FLINK-32242] Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.15.2
Fix Version/s: None
Component/s: Runtime / Metrics
Labels:
None
Environment:

Flink 1.15.2, AWS EMR.

Description

We're running a relatively small flink cluster (7 task-managers * 8 cores) and are using datadog for telemetry.

The numbers for outgoing traffic, between kafka producers, tasks activities, and host system metrics didn't add-up. After investigation, we discovered that this traffic was generated by the DatadogHttpReporter.

We switched the reporter to an implementation using the java dogstatsd client (reporting to a datadog agent on each host).

Here are some numbers of outgoing traffic taken at a NAT gateway, between the cluster and the outside world. Before/after this change (all other things being equal):

We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic is not free on AWS...

Here is the change on `flink.taskmanager.Status.JVM.CPU.Load` (over the whole cluster)

Reporting telemetry in json over http has a HUGE overhead.

So I would strongly advocate to deprecate this reporter, and recommend users to use a dogstatsd-based implementation. There exist one (https://github.com/aroch/flink-metrics-dogstatsd, not tested). On our side, we developed our own that we can share if requested.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-06-01-17-54-45-900.png
01/Jun/23 15:54
27 kB
Mathieu DESPRIEE
image-2023-06-01-17-56-50-809.png
01/Jun/23 15:56
22 kB
Mathieu DESPRIEE

Activity

People

Assignee:: Unassigned

Reporter:: Mathieu DESPRIEE

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Jun/23 16:29

Updated:: 01/Jun/23 16:35