Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32242

Datadog HTTP Reporter produces a huge outgoing traffic and CPU overhead

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.15.2
    • None
    • Runtime / Metrics
    • None
    • Flink 1.15.2, AWS EMR.

    Description

      We're running a relatively small flink cluster (7 task-managers * 8 cores) and are using datadog for telemetry.

      The numbers for outgoing traffic, between kafka producers, tasks activities, and host system metrics didn't add-up. After investigation, we discovered that this traffic was generated by the DatadogHttpReporter.

      We switched the reporter to an implementation using the java dogstatsd client (reporting to a datadog agent on each host).

      Here are some numbers of outgoing traffic taken at a NAT gateway, between the cluster and the outside world. Before/after this change (all other things being equal):

      We're talking about 850MB in 5mn, so 10GB/h overhead here. That kind of traffic is not free on AWS...

      Here is the change on `flink.taskmanager.Status.JVM.CPU.Load` (over the whole cluster)

      Reporting telemetry in json over http has a HUGE overhead.

      So I would strongly advocate to deprecate this reporter, and recommend users to use a dogstatsd-based implementation. There exist one (https://github.com/aroch/flink-metrics-dogstatsd, not tested). On our side, we developed our own that we can share if requested.

       

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            mathieude Mathieu DESPRIEE

            Dates

              Created:
              Updated:

              Slack

                Issue deployment