Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Done
-
v0.5.0
-
None
-
None
Description
We've identified some jobs with high RPC throughput which causes the NN heavy RPC overhead. These jobs has requested extremely large HDFS operations in a very short window (2 mins).
So we tend to capture those jobs with:
a) the job has very large RPC throughput, using the job total HDFS ops/the job duration, if the throughput is larger than 1000
b) and if the HDFS ops per task is larger than 25
Then send out the alert out. Later, we will notify the users to optimize their jobs.
Attachments
Issue Links
- links to