Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22876

spark.yarn.am.attemptFailuresValidityInterval does not work correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Minor
    • Resolution: Unresolved
    • 2.2.0
    • None
    • Spark Core, YARN
    • hadoop version 2.7.3

    Description

      I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping after acceptable number of failures.

      But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml)

      for example, following setup will allow the application master to fail 20 times.

      • spark.yarn.am.attemptFailuresValidityInterval=1s
      • spark.yarn.maxAppAttempts=20
      • yarn client: yarn.resourcemanager.am.max-attempts=20
      • yarn resource manager: yarn.resourcemanager.am.max-attempts=3

      And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293

      there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish.

      is this a expected design or a bug?

      Attachments

        Activity

          People

            Unassigned Unassigned
            jinhan.zhong Jinhan Zhong
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: