Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5933

ATS stale entries in active directory causes ApplicationNotFoundException in RM

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.3
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      On Secure cluster where ATS is down, Tez job submitted will fail while getting TIMELINE_DELEGATION_TOKEN with below exception

      0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from alltypesorc group by csmallint;
      INFO  : Session is already open
      INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
      INFO  : Tez session was closed. Reopening...
      ERROR : Failed to execute tez graph.
      java.lang.RuntimeException: Failed to connect to timeline server. Connection retries limit exceeded. The posted timeline event may be missing
      	at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
      	at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
      	at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
      	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
      	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
      	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
      	at org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
      	at org.apache.tez.client.TezClient.start(TezClient.java:409)
      	at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
      	at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
      	at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
      	at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
      	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
      	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
      	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
      	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
      	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
      	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
      	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
      	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
      	at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
      	at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
      	at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      Tez YarnClient has received an applicationID from RM. On Restarting ATS now, ATS tries to get the application report from RM and so RM will throw ApplicationNotFoundException. ATS will keep on requesting and which floods RM.

      RM logs:
      2016-11-23 13:53:57,345 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 5
      2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8050, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 172.26.71.120:37699 Call#26 Retry#0
      org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1479897867169_0005' doesn't exist in RM.
      	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
      	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
      	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
      

      There is a stale application entry inside /ats/active directory. ATS stops requesting when we remove this directory.

      [hive@kerberos-2 bin]$ hadoop fs -ls /ats/active
      drwxrwx--- - hive hadoop 0 2016-11-23 13:54 /ats/active/application_1479897867169_0005

      This issue with ATS is exposed by Tez job as Tez uses putDomain method. On calling TimelineClientImpl#putDomain() -> writeDomain() -> getAppAttemptDir() -> createApplicationDir() which creates a application directory inside ATS activePath. After Tez job created this, it fails as unable to connect to ATS. Now when ATS comes back, it scans activePath for every 60 seconds (yarn.timeline-service.entity-group-fs-store.scan-interval-seconds) and calls GetApplicationReport which leads to ApplicationNotFoundException in RM.

      For this negative case - we can delete the appDirectory inside activePath from ATS EntityGroupFSTimelineStore#getAppState() once the RM throws ApplicationNotFoundException.

        Attachments

          Activity

            People

            • Assignee:
              prabhujoseph Prabhu Joseph
              Reporter:
              prabhujoseph Prabhu Joseph
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: