[SPARK-2563] Re-open sockets to handle connect timeouts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions.

If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect.

FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries)

[1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573

Attachments

Issue Links

links to

[Github] Pull Request #1471 (shivaram)

Activity

People

Assignee:: Unassigned

Reporter:: Shivaram Venkataraman

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Jul/14 22:56

Updated:: 09/Feb/16 09:37

Resolved:: 09/Feb/16 09:37