Spark's network layer does not implement read timeouts which may lead to stalls during shuffle: if a remote shuffle server stalls while responding to a shuffle block fetch request but does not close the socket then the job may block until an OS-level socket timeout occurs.
I think that we can fix this using Netty's ReadTimeoutHandler (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). The tricky part of working on this will be figuring out the right place to add the handler and ensuring that we don't introduce performance issues by not re-using sockets.
Quoting from that linked StackOverflow question:
Note that the ReadTimeoutHandler is also unaware of whether you have sent a request - it only cares whether data has been read from the socket. If your connection is persistent, and you only want read timeouts to fire when a request has been sent, you'll need to build a request / response aware timeout handler.
If we want to avoid tearing down connections between shuffles then we may have to do something like this.