Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22199

Spark Job on YARN fails with executors "Slave registration failed"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 1.6.3
    • None
    • Spark Core, YARN
    • None

    Description

      Spark Job on YARN Failed with max executors Failed.

      ApplicationMaster logs:

      17/09/28 04:18:27 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Max number of executor failures (3) reached)
      

      Checking the failed container logs shows "Slave registration failed: Duplicate executor ID" whereas the Driver logs shows it has removed those executors as they are idle for spark.dynamicAllocation.executorIdleTimeout

      Executor Logs:

      17/09/28 04:18:26 ERROR CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID: 122
      

      Driver logs:

      17/09/28 04:18:21 INFO ExecutorAllocationManager: Removing executor 122 because it has been idle for 60 seconds (new desired total will be 133)
      

      There are two issues here:

      1. Error Message in executor is misleading "Slave registration failed: Duplicate executor ID" as the actual error is it was idle
      2. The job failed as there are executors idle for spark.dynamicAllocation.executorIdleTimeout

      Attachments

        Activity

          People

            Unassigned Unassigned
            prabhujoseph Prabhu Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: