Uploaded image for project: 'Eagle (Retired)'
  1. Eagle (Retired)
  2. EAGLE-1024

Monitor jobs with high RPC throughput

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Done
    • v0.5.0
    • v0.5.0
    • None
    • None

    Description

      We've identified some jobs with high RPC throughput which causes the NN heavy RPC overhead. These jobs has requested extremely large HDFS operations in a very short window (2 mins).

      So we tend to capture those jobs with:
      a) the job has very large RPC throughput, using the job total HDFS ops/the job duration, if the throughput is larger than 1000
      b) and if the HDFS ops per task is larger than 25
      Then send out the alert out. Later, we will notify the users to optimize their jobs.

      Attachments

        Issue Links

          Activity

            People

              qingwzhao Qingwen Zhao
              qingwzhao Qingwen Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack

                  Issue deployment