[SPARK-22958] Spark is stuck when the only one executor fails to register with driver - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: Spark Core, YARN
Labels:
- bulk-closed

Description

We have encountered the following scenario. We run a very simple job in yarn cluster mode. This job needs only one executor to complete. In the running, this job was stuck forever.

After checking the job log, we found an issue in the Spark. When executor fails to register with driver, YarnAllocator is blind to know it. As a result, the variable (numExecutorsRunning) maintained by YarnAllocator does not reflect the truth. When this variable is used to allocate resources to the running job, misunderstanding happens. As for our job, the misunderstanding results in forever stuck.

The more details are as follows. The following figure shows how executor is allocated when the job starts to run. Now suppose only one executor is needed. In the figure, step 1,2,3 show how the executor is allocated. After the executor is allocated, it needs to register with the driver (step 4) and the driver responses to it (step 5). After the 5 steps, the executor can be used to run tasks.

In YarnAllocator, when step 3 is finished, it will increase the the variable "numExecutorsRunning" by one as shown in the following code.

def updateInternalState(): Unit = synchronized {
        // increase the numExecutorsRunning 
        numExecutorsRunning += 1
        executorIdToContainer(executorId) = container
        containerIdToExecutorId(container.getId) = executorId

        val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
          new HashSet[ContainerId])
        containerSet += containerId
        allocatedContainerToHostMap.put(containerId, executorHostname)
      }

      if (numExecutorsRunning < targetNumExecutors) {
        if (launchContainers) {
          launcherPool.execute(new Runnable {
            override def run(): Unit = {
              try {
                new ExecutorRunnable(
                  Some(container),
                  conf,
                  sparkConf,
                  driverUrl,
                  executorId,
                  executorHostname,
                  executorMemory,
                  executorCores,
                  appAttemptId.getApplicationId.toString,
                  securityMgr,
                  localResources
                ).run()
                // step 3 is finished
                updateInternalState()
              } catch {
                case NonFatal(e) =>
                  logError(s"Failed to launch executor $executorId on container $containerId", e)
                  // Assigned container should be released immediately to avoid unnecessary resource
                  // occupation.
                  amClient.releaseAssignedContainer(containerId)
              }
            }
          })
        } else {
          // For test only
          updateInternalState()
        }
      } else {
        logInfo(("Skip launching executorRunnable as runnning Excecutors count: %d " +
          "reached target Executors count: %d.").format(numExecutorsRunning, targetNumExecutors))
      }

Imagine the step 3 successes, but the step 4 is failed due to some reason (for example network fluctuation). The variable "numExecutorsRunning" is equal to 1. But, the fact is no executor is running. So, The variable "numExecutorsRunning" does not reflect the real number of running executors. For YarnAllocator, because the variable is equal to 1, it does not allocate any new executor even though no executor is actually running. If one job only needs one executor to complete, it will stuck forever since no executor runs its tasks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

How new executor is registered.png
04/Jan/18 12:11
21 kB
Shaoquan Zhang

Activity

People

Assignee:: Unassigned

Reporter:: Shaoquan Zhang

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Jan/18 12:10

Updated:: 17/May/20 18:13

Resolved:: 21/May/19 04:12