Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12583

spark shuffle fails with mesos after 2mins

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0
    • 2.0.0
    • Shuffle, Spark Core
    • None

    Description

      See user mailing list "Executor deregistered after 2mins" for more details.

      As of 1.6, the driver registers with each shuffle manager via MesosExternalShuffleClient. Once this disconnects, the shuffle manager automatically cleans up the data associate with that driver.

      However, the connection is terminated before this happens as it's idle. Looking at a packet trace, after 120secs the shuffle manager is sending a FIN packet to the driver. The only way to delay this is to increase spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.

      I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with newbie Scala skills to call the TransportContext call with closeIdleConnections "false" and this didn't help (hadn't done the network trace first).

      Attachments

        Issue Links

          Activity

            People

              bbossy Bertrand Bossy
              abridgett Adrian Bridgett
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: