Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12583

spark shuffle fails with mesos after 2mins

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0
    • Fix Version/s: 2.0.0
    • Component/s: Shuffle, Spark Core
    • Labels:
      None
    • Target Version/s:

      Description

      See user mailing list "Executor deregistered after 2mins" for more details.

      As of 1.6, the driver registers with each shuffle manager via MesosExternalShuffleClient. Once this disconnects, the shuffle manager automatically cleans up the data associate with that driver.

      However, the connection is terminated before this happens as it's idle. Looking at a packet trace, after 120secs the shuffle manager is sending a FIN packet to the driver. The only way to delay this is to increase spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.

      I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with newbie Scala skills to call the TransportContext call with closeIdleConnections "false" and this didn't help (hadn't done the network trace first).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bbossy Bertrand Bossy
                Reporter:
                abridgett Adrian Bridgett
              • Votes:
                2 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: