Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.0
-
None
Description
See user mailing list "Executor deregistered after 2mins" for more details.
As of 1.6, the driver registers with each shuffle manager via MesosExternalShuffleClient. Once this disconnects, the shuffle manager automatically cleans up the data associate with that driver.
However, the connection is terminated before this happens as it's idle. Looking at a packet trace, after 120secs the shuffle manager is sending a FIN packet to the driver. The only way to delay this is to increase spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with newbie Scala skills to call the TransportContext call with closeIdleConnections "false" and this didn't help (hadn't done the network trace first).
Attachments
Issue Links
- is duplicated by
-
SPARK-13159 External shuffle service broken w/ Mesos
- Closed
- links to