Affects Version/s: 3.0.0
Fix Version/s: None
We run Spark on YARN, and deploy Spark external shuffle service as part of YARN NM aux service. External Shuffle Service is shared by all Spark applications.
SPARK-24355 enables the threads reservation to handle non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to avoid OOM in shuffle service which could crash NodeManager. But still some application may generate a large amount of shuffle blocks which could heavily decrease the performance on some shuffle servers. When this abusing behavior happens, it might further decreases the overall performance for other applications if they happen to use the same shuffle servers. We have been seeing issues like this in our cluster, but there is no way for us to figure out which application is abusing shuffle service. SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics System. It is better if we can have the following metrics and also metrics divided by applicationID:
1. shuffle server on-heap memory consumption for caching shuffle indexes
2. breakdown of shuffle indexes caching memory consumption by local executors
We can generate metrics when ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will trigger the Cache load. We can roughly be able to get the metrics from the shuffleindexfile size when putting into the cache and moved out from the cache.
3. shuffle server load for shuffle block fetch requests
4. breakdown of shuffle server block fetch requests load by remote executors
We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a new OpenBlocks message is received.
Open discussion for more metrics that could potentially influence the overall shuffle service performance.
We can print out those metrics which are divided by applicationIDs in log, since it is hard to define fixed key and use numerical value for this kind of metrics.