Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20079

Re registration of AM hangs spark cluster in yarn-client mode

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.3.0
    • Component/s: YARN
    • Labels:
      None

      Description

      The ExecutorAllocationManager.reset method is called when re-registering AM, which sets the ExecutorAllocationManager.initializing field true. When this field is true, the Driver does not start a new executor from the AM request. The following two cases will cause the field to False

      1. A executor idle for some time.
      2. There are new stages to be submitted

      After the a stage was submitted, the AM was killed and restart ,the above two cases will not appear.

      1. When AM is killed, the yarn will kill all running containers. All execuotr will be lost and no executor will be idle.
      2. No surviving executor, resulting in the current stage will never be completed, DAG will not submit a new stage.

      Reproduction steps:

      1. Start cluster

      echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell  --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2
      

      2. Kill the AM process when a stage is scheduled.

        Attachments

          Activity

            People

            • Assignee:
              vanzin Marcelo Vanzin
              Reporter:
              gq Guoqiang Li
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: