Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Issue 2:
I am working on the Celeborn Alert SOP for how to figure out the top app usages.
For this expr, I can not find the applicationId that wroten 140TB shuffle data.
```
topk(5, sum by (applicationId) (metrics_diskBytesWritten_Value{role="worker", applicationId=~"^application_.*"}))
```
<img width="1491" alt="image" src="https://github.com/user-attachments/assets/c7caa5d1-c99c-4062-8c78-e7bd8ed5c3db">
But with this expr, the shuffle size match.
```
topk(5, sum by (name) (metrics_diskBytesWritten_Value{role="worker", applicationId=""}))
```
<img width="1490" alt="image" src="https://github.com/user-attachments/assets/da7f53c5-cc75-4856-97f8-fb12ec80addc">
Please note that, the celeborn cluster has not take traffic, and only one testing application was running at that time.
<img width="937" alt="image" src="https://github.com/user-attachments/assets/8d2a1833-b982-487f-988a-b1b4db1764bd">
Per the metrics, it seems that some metrics `metrics_diskBytesWritten_Value` with applicationId label were lost.
Attachments
Issue Links
- links to