Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8580

yarn.resourcemanager.am.max-attempts is not respected for yarn services

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 3.1.1
    • None
    • yarn-native-services
    • None

    Description

      1) Max am attempt is set to 100 on all nodes. ( including gateway)

       <property>
            <name>yarn.resourcemanager.am.max-attempts</name>
            <value>100</value>
          </property>

      2) Start a Yarn service ( Hbase tarball ) application
      3) Kill AM 20 times

      Here, App fails with below diagnostics.

      bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status application_1532481557746_0001
      18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History server at xxx/xxx:10200
      18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
      18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
      Application Report : 
      	Application-Id : application_1532481557746_0001
      	Application-Name : hbase-tarball-lr
      	Application-Type : yarn-service
      	User : hbase
      	Queue : default
      	Application Priority : 0
      	Start-Time : 1532481864863
      	Finish-Time : 1532522943103
      	Progress : 100%
      	State : FAILED
      	Final-State : FAILED
      	Tracking-URL : https://xxx:8090/cluster/app/application_1532481557746_0001
      	RPC Port : -1
      	AM Host : N/A
      	Aggregate Resource Allocation : 252150112 MB-seconds, 164141 vcore-seconds
      	Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
      	Log Aggregation Status : SUCCEEDED
      	Diagnostics : Application application_1532481557746_0001 failed 20 times (global limit =100; local limit is =20) due to AM Container for appattempt_1532481557746_0001_000020 exited with  exitCode: 137
      Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed on request. Exit code is 137
      [2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. 
      [2018-07-25 12:49:03.045]Killed by external signal
      For more detailed output, check the application tracking page: https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on links to logs of each attempt.
      . Failing the application.
      	Unmanaged Application : false
      	Application Node Label Expression : <Not set>
      	AM container Node Label Expression : <DEFAULT_PARTITION>
      	TimeoutType : LIFETIME	ExpiryTime : 2018-07-25T22:26:15.419+0000	RemainingTime : 0seconds
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            yeshavora Yesha Vora
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: