Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10368

'Kerberized YARN on Docker test' unstable

    XMLWordPrintableJSON

Details

    Description

      Running Kerberized YARN on Docker test end-to-end test failed on an AWS instance. The problem seems to be that the NameNode went into safe-mode due to limited resources.

      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/home/hadoop-user/flink-1.6.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      2018-09-19 09:04:39,201 INFO  org.apache.hadoop.security.UserGroupInformation               - Login successful for user hadoop-user using keytab file /home/hadoop-user/hadoop-user.keytab
      2018-09-19 09:04:39,453 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at master.docker-hadoop-cluster-network/172.22.0.3:8032
      2018-09-19 09:04:39,640 INFO  org.apache.hadoop.yarn.client.AHSProxy                        - Connecting to Application History server at master.docker-hadoop-cluster-network/172.22.0.3:10200
      2018-09-19 09:04:39,656 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
      2018-09-19 09:04:39,656 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
      2018-09-19 09:04:39,901 INFO  org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Cluster specification: ClusterSpecification{masterMemoryMB=2000, taskManagerMemoryMB=2000, numberTaskManagers=3, slotsPerTaskManager=1}
      2018-09-19 09:04:40,286 WARN  org.apache.flink.yarn.AbstractYarnClusterDescriptor           - The configuration directory ('/home/hadoop-user/flink-1.6.1/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
      
      ------------------------------------------------------------
       The program finished with the following exception:
      
      org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:420)
              at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:259)
              at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
              at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
              at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
              at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
              at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
      Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar. Name node is in safe mode.
      Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:master.docker-hadoop-cluster-network
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
      
              at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
              at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
              at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
              at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
              at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
              at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
              at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:270)
              at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1274)
              at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1216)
              at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473)
              at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:470)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:470)
              at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:411)
              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
              at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:368)
              at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
              at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2002)
              at org.apache.flink.yarn.Utils.setupLocalResource(Utils.java:162)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.setupSingleLocalResource(AbstractYarnClusterDescriptor.java:1139)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.access$000(AbstractYarnClusterDescriptor.java:111)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1200)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1188)
              at java.nio.file.Files.walkFileTree(Files.java:2670)
              at java.nio.file.Files.walkFileTree(Files.java:2742)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.uploadAndRegisterFiles(AbstractYarnClusterDescriptor.java:1188)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:800)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:542)
              at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:413)
              ... 9 more
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar. Name node is in safe mode.
      Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:master.docker-hadoop-cluster-network
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
      
              at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
              at org.apache.hadoop.ipc.Client.call(Client.java:1435)
              at org.apache.hadoop.ipc.Client.call(Client.java:1345)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
              at com.sun.proxy.$Proxy14.create(Unknown Source)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
              at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
              at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
              at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
              at com.sun.proxy.$Proxy15.create(Unknown Source)
              at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:265)
              ... 33 more
      Running the Flink job failed, might be that the cluster is not ready yet. We have been trying for 795 seconds, retrying ...
      

      I think it would be good to harden the test.

      Attachments

        Activity

          People

            aljoscha Aljoscha Krettek
            trohrmann Till Rohrmann
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m