I see this happening frequently in our prod clusters:
- EXECUTOR: CoarseGrainedExecutorBackend sends request to register itself to the driver.
- DRIVER: Registers executor and replies
- EXECUTOR: ExecutorBackend receives ACK and starts creating an Executor
- DRIVER: Tries to launch a task as it knows there is a new executor. Sends a LaunchTask to this new executor.
- EXECUTOR: Executor is not init'ed (one of the reasons I have seen is because it was still trying to register to local external shuffle service). Meanwhile, receives a `LaunchTask`. Kills itself as Executor is not init'ed.
The driver assumes that Executor is ready to accept tasks as soon as it is registered but thats not true.
How this affects jobs / cluster:
- We waste time + resources with these executors but they don't do any meaningful computation.
- Driver thinks that the executor has started running the task but since the Executor has self killed, it does not tell driver (BTW: this is also another issue which I think could be fixed separately). Driver waits for 10 mins and then declares the executor dead. This adds up to the latency of the job. Plus, failure attempts also gets bumped up for the tasks despite the tasks were never started. For unlucky tasks, this might cause the job failure.