Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.4.1
-
None
Description
Yarn applications should be robust to NodeManager restarts in general. However, if you run with the external shuffle service enabled, after a NM restart, you will observe failures like:
2015-07-22 18:30:18,212 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 5405054848584757735 java.lang.RuntimeException: Executor is not registered (appId=application_1437612356649_0008, execId=73) at org.apache.spark.network.shuffle.ExternalShuffleBlockManager.getBlockData(ExternalShuffleBlockManager.java:105) ...
This is because when the NM restarts (and restarts the ExternalShuffleService), it doesn't call ExternalShuffleBlockResolver#registerExecutor
Attachments
Issue Links
- breaks
-
SPARK-12807 Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
- Resolved
- is duplicated by
-
SPARK-14527 Job can't finish when restart all nodemanages with using external shuffle services
- Resolved
- links to