Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10294

NodeManager shows a wrong reason when a YARN service fails to start

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.3.0
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None

      Description

      We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 gets assigned and tries to start the service. We can see from its log:

      2020-05-28 14:48:18,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_6_0001_01_000001]
      2020-05-28 14:48:18,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from SCHEDULED to RUNNING

      Due to some misconfiguration, the container fails to start. We can also see from the container's serviceam.log:

      2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - Background retry gave up
      org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)                    
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
      2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - Failed to register app sleeper1 in registry
      org.apache.hadoop.registry.client.exceptions.RegistryIOException: `/registry/users/root/services/yarn-service': Failure of mkdir()  on /registry/users/root/services/yarn-service: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
        at org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
        at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
        at org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
        at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
        at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
        at org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)     
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)                    
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)                                       
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)       
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)        
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)                  
        at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
        at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
        at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)              
        at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)                                                                                                                                                                
        at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
        at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
        at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
        at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
        at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587)
        ... 11 more
      2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO  service.AbstractService - Service sleeper1 failed in state STARTED
      java.lang.NullPointerException                                                   
        at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
        at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
        at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)                                                                                                                                                                                        
        at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
        at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
        at java.security.AccessController.doPrivileged(Native Method)                  
        at javax.security.auth.Subject.doAs(Subject.java:422)                          
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
        at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)
      
      ...
      
      2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR service.ServiceMaster - Error starting service master
      java.lang.NullPointerException                                                   
        at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
        at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
        at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
        at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
        at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
        at java.security.AccessController.doPrivileged(Native Method)                  
        at javax.security.auth.Subject.doAs(Subject.java:422)                          
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
        at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)   
      2020-05-28 14:49:13,463 [IPC Server Responder] INFO  ipc.Server - Stopping IPC Server Responder
      2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO  util.ExitUtil - Exiting with status 1: Error starting service master

       

      However, NM1 shows an incorrect reason for why the container fails to start:

      2020-05-28 14:49:13,516 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_6_0001_01_000001 and exit code: 1
      ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
      /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)        
      /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)        
      
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)                    
        at org.apache.hadoop.util.Shell.run(Shell.java:901)                            
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)  
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)                    
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)                                       
      2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
      2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_6_0001_01_000001
      2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
      2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
      2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
      2020-05-28 14:49:13,517 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
      2020-05-28 14:49:13,519 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 
      2020-05-28 14:49:13,580 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE

      The above error shown by NM1 is actually a warning from the container's prelaunch.err file, which differs from the error as shown in the container's serviceam.log.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ycozy YCozy

              Dates

              • Created:
                Updated:

                Issue deployment