Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.2.0
-
None
Description
The current set of metrics in the external shuffle service are fairly limited. To debug failure of the shuffle service, it would be good to get more information regarding the state of the shuffle service. As a first cut, the following metrics seem important:
1. The amount of heap memory used by the External Shuffle Service process
2. The amount of direct buffer (off-heap) memory allocated to External Shuffle Service. In the external shuffle service, Netty uses off-heap memory. Monitoring its usage can help in allocating appropriate resources and can also be used to raise alarms when the allocated memory exceeds a threshold.
3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open Block requests can be dropped as a result of Netty queue overflows (resulting in FetchFailure). Having hard data on queue size can help in attributing cause of failures.
Please let me know of other metrics (from Shuffle Service perspective) that would be good to have.