Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20079

Re registration of AM hangs spark cluster in yarn-client mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.3.0
    • Spark Core, YARN
    • None

    Description

      The ExecutorAllocationManager.reset method is called when re-registering AM, which sets the ExecutorAllocationManager.initializing field true. When this field is true, the Driver does not start a new executor from the AM request. The following two cases will cause the field to False

      1. A executor idle for some time.
      2. There are new stages to be submitted

      After the a stage was submitted, the AM was killed and restart ,the above two cases will not appear.

      1. When AM is killed, the yarn will kill all running containers. All execuotr will be lost and no executor will be idle.
      2. No surviving executor, resulting in the current stage will never be completed, DAG will not submit a new stage.

      Reproduction steps:

      1. Start cluster

      echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell  --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2
      

      2. Kill the AM process when a stage is scheduled.

      Attachments

        Activity

          People

            vanzin Marcelo Masiero Vanzin
            gq Guoqiang Li
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: