Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22876

spark.yarn.am.attemptFailuresValidityInterval does not work correctly

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: YARN
    • Labels:
      None
    • Environment:

      hadoop version 2.7.3

      Description

      I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping after acceptable number of failures.

      But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml)

      for example, following setup will allow the application master to fail 20 times.

      • spark.yarn.am.attemptFailuresValidityInterval=1s
      • spark.yarn.maxAppAttempts=20
      • yarn client: yarn.resourcemanager.am.max-attempts=20
      • yarn resource manager: yarn.resourcemanager.am.max-attempts=3

      And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293

      there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish.

      is this a expected design or a bug?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jinhan.zhong Jinhan Zhong
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: