We run Spark on YARN, and deploy Spark external shuffle service as part of YARN NM aux service.
One issue we saw with Spark external shuffle service is the various timeout experienced by the clients on either registering executor with local shuffle server or establish connection to remote shuffle server.
Example of a timeout for establishing connection with remote shuffle server:
Example of a timeout for registering executor with local shuffle server:
While patches such as
SPARK-20640 and config parameters such as spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when spark.authenticate is set to true) could help to alleviate this type of problems, it does not solve the fundamental issue.
We have observed that, when the shuffle workload gets very busy in peak hours, the client requests could timeout even after configuring these parameters to very high values. Further investigating this issue revealed the following issue:
Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads.
In order to process a client request, it would require one available server netty handler thread.
However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients.
As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process.
This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server.
Will post the patch soon, and want to gather feedbacks from the community.