Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32242

Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.15.2
    • None
    • Runtime / Metrics
    • None
    • Flink 1.15.2, AWS EMR.

    Description

      We're running a relatively small flink cluster (7 task-managers * 8 cores) and are using datadog for telemetry.

      The numbers for outgoing traffic, between kafka producers, tasks activities, and host system metrics didn't add-up. After investigation, we discovered that this traffic was generated by the DatadogHttpReporter.

      We switched the reporter to an implementation using the java dogstatsd client (reporting to a datadog agent on each host).

      Here are some numbers of outgoing traffic taken at a NAT gateway, between the cluster and the outside world. Before/after this change (all other things being equal):

      We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic is not free on AWS...

      Here is the change on `flink.taskmanager.Status.JVM.CPU.Load` (over the whole cluster)

      Reporting telemetry in json over http has a HUGE overhead.

      So I would strongly advocate to deprecate this reporter, and recommend users to use a dogstatsd-based implementation. There exist one (https://github.com/aroch/flink-metrics-dogstatsd, not tested). On our side, we developed our own that we can share if requested.

       

       

      Attachments

        1. image-2023-06-01-17-54-45-900.png
          27 kB
          Mathieu DESPRIEE
        2. image-2023-06-01-17-56-50-809.png
          22 kB
          Mathieu DESPRIEE

        Activity

          People

            Unassigned Unassigned
            mathieude Mathieu DESPRIEE
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: