Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16230

Executors self-killing after being assigned tasks while still in init

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      I see this happening frequently in our prod clusters:

      • EXECUTOR: CoarseGrainedExecutorBackend sends request to register itself to the driver.
      • DRIVER: Registers executor and replies
      • EXECUTOR: ExecutorBackend receives ACK and starts creating an Executor
      • DRIVER: Tries to launch a task as it knows there is a new executor. Sends a LaunchTask to this new executor.
      • EXECUTOR: Executor is not init'ed (one of the reasons I have seen is because it was still trying to register to local external shuffle service). Meanwhile, receives a `LaunchTask`. Kills itself as Executor is not init'ed.

      The driver assumes that Executor is ready to accept tasks as soon as it is registered but thats not true.

      How this affects jobs / cluster:

      • We waste time + resources with these executors but they don't do any meaningful computation.
      • Driver thinks that the executor has started running the task but since the Executor has self killed, it does not tell driver (BTW: this is also another issue which I think could be fixed separately). Driver waits for 10 mins and then declares the executor dead. This adds up to the latency of the job. Plus, failure attempts also gets bumped up for the tasks despite the tasks were never started. For unlucky tasks, this might cause the job failure.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tejasp Tejas Patil
                Reporter:
                tejasp Tejas Patil
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: