Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2563

Re-open sockets to handle connect timeouts

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • Spark Core
    • None

    Description

      In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions.

      If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect.

      FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries)

      [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573

      Attachments

        Activity

          People

            Unassigned Unassigned
            shivaram Shivaram Venkataraman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: