Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions.
If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect.
FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries)