Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27347

Fix supervised driver retry logic when agent crashes/restarts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.1, 2.3.2, 2.4.0
    • 2.3.4, 2.4.4, 3.0.0
    • Mesos
    • None

    Description

      Ran into scenarios where --supervised Spark jobs were retried multiple times when an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.

      That is:

      • supervised driver is running on agent1
      • agent1 crashes
      • driver is relaunched on another agent as `<task-id>-retry-1`
      • agent1 comes back online and re-registers with scheduler
      • spark relaunches the same job as `<task-id>-retry-2`
      • now there are two jobs running simultaneously and the first retry job is effectively orphaned within Zookeeper

      This is because when an agent comes back and re-registers, it sends a status update TASK_FAILED for its old driver-task. Previous logic would indiscriminately remove the submissionId from Zookeeper's launchedDrivers node and add it to retryList node.

      Then, when a new offer came in, it would relaunch another -retry task even though one was previously running.

      Sample log looks something like this: 

      19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
      ... [offers] ...
      19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
      ...
      19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
      19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
      ...
      19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED
      ...
      19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1"
      ...
      19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
      19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
      ...
      19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered'
      ...
      19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
      19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update
      ...
      19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2"
      ...
      19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
      19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''

      Attachments

        Issue Links

          Activity

            People

              stran Sam Tran
              stran Sam Tran
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: