Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-15389

Intermittent YARN service check failures during and post EU

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.2
    • 2.2.2
    • ambari-server
    • None

    Description

      Build # - Ambari 2.2.1.1 - #63

      Observed this issue in a couple of EU runs recently where YARN service check reports failure
      a. In one test, the EU ran from HDP 2.3.4.0 to 2.4.0.0 and YARN service check reported failure during EU itself; a retry of the operation led to service check being successful

      b. In another test post EU when YARN service check was run, it reported failure; afterwards when I ran it again - success

      Looks like there is some corner condition which causes this issue to be hit

      stderr:   /var/lib/ambari-agent/data/errors-822.txt
      
      Traceback (most recent call last):
      File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", line 142, in <module>
      ServiceCheck().execute()
      File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
      method(env)
      File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/service_check.py", line 104, in service_check
      user=params.smokeuser,
      File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
      result = function(command, **kwargs)
      File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
      tries=tries, try_sleep=try_sleep)
      File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
      result = _call(command, **kwargs_copy)
      File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
      raise Fail(err_msg)
      resource_management.core.exceptions.Fail: Execution of '/usr/bin/kinit -kt /etc/security/keytabs/smokeuser.headless.keytab ambari-qa@EXAMPLE.COM; yarn org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls -num_containers 1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar' returned 2. ######## Hortonworks #############
      This is MOTD message, added for testing in qe infra
      16/03/03 02:33:51 INFO impl.TimelineClientImpl: Timeline service address: http://host:8188/ws/v1/timeline/
      16/03/03 02:33:51 INFO distributedshell.Client: Initializing Client
      16/03/03 02:33:51 INFO distributedshell.Client: Running Client
      16/03/03 02:33:51 INFO client.RMProxy: Connecting to ResourceManager at host-9-5.test/127.0.0.254:8050
      16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=3
      16/03/03 02:33:53 INFO distributedshell.Client: Got Cluster node info from ASM
      16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, nodeId=host:25454, nodeAddresshost:8042, nodeRackName/default-rack, nodeNumContainers1
      16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, nodeId=host-9-5.test:25454, nodeAddresshost-9-5.test:8042, nodeRackName/default-rack, nodeNumContainers0
      16/03/03 02:33:53 INFO distributedshell.Client: Got node report from ASM for, nodeId=host-9-1.test:25454, nodeAddresshost-9-1.test:8042, nodeRackName/default-rack, nodeNumContainers0
      16/03/03 02:33:53 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.083333336, queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
      16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
      16/03/03 02:33:53 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
      16/03/03 02:33:53 INFO distributedshell.Client: Max mem capabililty of resources in this cluster 10240
      16/03/03 02:33:53 INFO distributedshell.Client: Max virtual cores capabililty of resources in this cluster 1
      16/03/03 02:33:53 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
      16/03/03 02:33:53 INFO distributedshell.Client: Set the environment for the application master
      16/03/03 02:33:53 INFO distributedshell.Client: Setting up app master command
      16/03/03 02:33:53 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx10m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory 10 --container_vcores 1 --num_containers 1 --priority 0 1><LOG_DIR>/AppMaster.stdout 2><LOG_DIR>/AppMaster.stderr
      16/03/03 02:33:53 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 290 for ambari-qa on 127.0.0.235:8020
      16/03/03 02:33:53 INFO distributedshell.Client: Got dt for hdfs://host-9-1.test:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 127.0.0.235:8020, Ident: (HDFS_DELEGATION_TOKEN token 290 for ambari-qa)
      16/03/03 02:33:53 INFO distributedshell.Client: Submitting application to ASM
      16/03/03 02:33:54 INFO impl.YarnClientImpl: Submitted application application_1456970141888_0011
      16/03/03 02:33:55 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:33:56 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:33:57 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:33:58 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:33:59 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:00 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:01 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:02 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:03 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:04 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:05 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:06 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:07 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:08 INFO distributedshell.Client: Got application report from ASM for, appId=11, clientToAMToken=Token { kind: YARN_CLIENT_TOKEN, service:  }, appDiagnostics=, appMasterHost=host-9-1/127.0.0.235, appQueue=default, appMasterRpcPort=-1, appStartTime=1456972434150, yarnAppState=FINISHED, distributedFinalState=FAILED, appTrackingUrl=http://host-9-5.test:8088/proxy/application_1456970141888_0011/, appUser=ambari-qa
      16/03/03 02:34:08 INFO distributedshell.Client: Application did finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop
      16/03/03 02:34:08 ERROR distributedshell.Client: Application failed to complete successfully
      stdout:   /var/lib/ambari-agent/data/output-822.txt
      
      2016-03-03 02:33:47,974 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf
      2016-03-03 02:33:48,013 - Using hadoop conf dir: /usr/hdp/current/hadoop-client/conf
      2016-03-03 02:33:48,018 - checked_call['/usr/bin/kinit -kt /etc/security/keytabs/smokeuser.headless.keytab ambari-qa@EXAMPLE.COM; yarn org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls -num_containers 1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar'] {'path': '/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin', 'user': 'ambari-qa'}
      

      Attachments

        1. AMBARI-15389_2.2.patch
          1 kB
          Antonenko Alexander
        2. AMBARI-15389.patch
          1 kB
          Antonenko Alexander
        3. AMBARI-15389.patch
          1 kB
          Dmitry Lysnichenko

        Issue Links

          Activity

            People

              aantonenko Antonenko Alexander
              dmitriusan Dmitry Lysnichenko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: