Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31347

Unable to run Spark Job on Federated Yarn Cluster, AMRMToken invalid

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Deploy, Spark Core, YARN
    • None

    Description

      Running Spark on Yarn 3.2.1 in federated cluster

      ApplicationMaster fails to register with resourcemanager, and throws a InvalidToken exception.

      root@yarn-master-0:/hadoop/spark# HADOOP_CONF_DIR=/hadoop/federation/router ./bin/spark-submit \          
      --class org.apache.spark.examples.SparkPi \                                                                                                                                                                                                                                                                                   
      --master yarn \                                                                                                                                                
      --deploy-mode cluster \                                                    
      --driver-memory 4g \                                                                                                                                                                                                                                                                                                          
      --executor-memory 2g \                                                                                                                                         
      --executor-cores 1 \                                                                                                                                           
      --queue default \                                                                                                                                                                                                                                                                                                             
      examples/jars/spark-examples*.jar 10                                                                                                                                                                                                                                                                                          
                                                                                                                                                                     
      2020-04-04 16:44:18,144 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable      
      2020-04-04 16:44:18,345 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200                                                                                                                                                                                                                      
      2020-04-04 16:44:18,402 INFO yarn.Client: Requesting a new application from cluster with 10 NodeManagers
      2020-04-04 16:44:18,753 INFO conf.Configuration: resource-types.xml not found                                                                                  
      2020-04-04 16:44:18,754 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.                                                                      
      2020-04-04 16:44:18,766 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (7168 MB per container)
      2020-04-04 16:44:18,767 INFO yarn.Client: Will allocate AM container, with 4505 MB memory including 409 MB overhead                                            
      2020-04-04 16:44:18,767 INFO yarn.Client: Setting up container launch context for our AM                                                                       
      2020-04-04 16:44:18,768 INFO yarn.Client: Setting up the launch environment for our AM container                                                                                                                                                                                                                              
      2020-04-04 16:44:18,776 INFO yarn.Client: Preparing resources for our AM container                                                                             
      2020-04-04 16:44:18,805 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.                                                                                                                                                                        
      2020-04-04 16:44:19,890 INFO yarn.Client: Uploading resource file:/tmp/spark-cfcf1976-612e-4b64-8bf3-5b0c8f1dc6ec/__spark_libs__5444968329971306297.zip -> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/__spark_libs__5444968329971306297.zip
      2020-04-04 16:44:22,689 INFO yarn.Client: Uploading resource file:/hadoop/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar -> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/spark-examples_2.12-3.0.0-preview2.jar
      2020-04-04 16:44:22,832 INFO yarn.Client: Uploading resource file:/tmp/spark-cfcf1976-612e-4b64-8bf3-5b0c8f1dc6ec/__spark_conf__2558260056925734476.zip -> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/__spark_conf__.zip
      2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing view acls to: root                                                                                
      2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing modify acls to: root                                                                                                                                                                                                                                             
      2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing view acls groups to:                                                                              
      2020-04-04 16:44:22,887 INFO spark.SecurityManager: Changing modify acls groups to:                                                                            
      2020-04-04 16:44:22,887 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
      2020-04-04 16:44:22,927 INFO yarn.Client: Submitting application application_1586018216728_0005 to ResourceManager                                             
      2020-04-04 16:44:22,963 INFO impl.YarnClientImpl: Submitted application application_1586018216728_0005                                                         
      2020-04-04 16:44:23,967 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)                                                                                                                                                                                                             
      2020-04-04 16:44:23,969 INFO yarn.Client:                                                                                                                                                                                                                                                                                     
               client token: N/A                                                                                                                                                                                                                                                                                                    
               diagnostics: AM container is launched, waiting for AM container to Register with RM                                                                   
               ApplicationMaster host: N/A                                                                                                                           
               ApplicationMaster RPC port: -1                       
               queue: default                                                                                                                                        
               start time: 1586018662937                                                                                                                             
               final status: UNDEFINED                                                                                                                               
               tracking URL: http://yarn-master-0.yarn-service.yarn-subcluster-a.svc.cluster.local:8088/proxy/application_1586018216728_0005/                                                                                                                                                                                       
               user: root                                                                                                                                            
      2020-04-04 16:44:24,972 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)                                                                                                                                                                                                             
      2020-04-04 16:44:25,974 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)                                                                                                                                                                                                             
      2020-04-04 16:44:26,977 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)  
      2020-04-04 16:44:27,980 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)  
      2020-04-04 16:44:28,983 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)                                                                                                                                                                                                             
      2020-04-04 16:44:29,985 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)                                                                                                                                                                                                             
      2020-04-04 16:44:30,988 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)
      2020-04-04 16:44:31,991 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)
      2020-04-04 16:44:32,994 INFO yarn.Client: Application report for application_1586018216728_0005 (state: ACCEPTED)                                                                                                                                                                                                             
      2020-04-04 16:44:33,996 INFO yarn.Client: Application report for application_1586018216728_0005 (state: FAILED)
      2020-04-04 16:44:33,997 INFO yarn.Client:                                                                                                                      
               client token: N/A                                                                                                                                     
               diagnostics: Application application_1586018216728_0005 failed 2 times due to AM Container for appattempt_1586018216728_0005_000002 exited with  exitCode: 13
      Failing this attempt.Diagnostics: [2020-04-04 16:44:33.276]Exception from container-launch.                        
      Container id: container_e27933_1586018216728_0005_02_000001                                                                                                    
      Exit code: 13 
      
       
      [2020-04-04 16:44:33.297]Container exited with a non-zero exit code 13. Error file: prelaunch.err. 
      Last 4096 bytes of prelaunch.err : 
      Last 4096 bytes of stderr : 
      ect.Constructor.newInstance(Constructor.java:423) 
       at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) 
       at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) 
       at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) 
       at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
       at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
       at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
       at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
       at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
       at com.sun.proxy.$Proxy16.registerApplicationMaster(Unknown Source)
       at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:246)
       at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:233)
       at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:213)
       at org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:71)
       at org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:426)
       at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:504)
       at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:262)
       at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:875)
       at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:874)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
       at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:874)
       at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
      Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1586018216728_0005_000002
       at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
       at org.apache.hadoop.ipc.Client.call(Client.java:1457)
       at org.apache.hadoop.ipc.Client.call(Client.java:1367)
       at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
       at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
       at com.sun.proxy.$Proxy15.registerApplicationMaster(Unknown Source)
       at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:107)
       ... 24 more
      )
      2020-04-04 16:44:32,555 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005
      2020-04-04 16:44:32,926 INFO storage.DiskBlockManager: Shutdown hook called
      2020-04-04 16:44:32,930 INFO util.ShutdownHookManager: Shutdown hook called
      2020-04-04 16:44:32,930 INFO util.ShutdownHookManager: Deleting directory /opt/hadoop/hadooptmpdata/nm-local-dir/usercache/root/appcache/application_1586018216728_0005/spark-5d3f083f-eb43-49e9-a779-2354e07e9bd7/userFiles-1721c4df-1674-4695-b3aa-02e8c72908c0
      2020-04-04 16:44:32,932 INFO util.ShutdownHookManager: Deleting directory /opt/hadoop/hadooptmpdata/nm-local-dir/usercache/root/appcache/application_1586018216728_0005/spark-5d3f083f-eb43-49e9-a779-2354e07e9bd7
                                                                                                                                                     
      

       

      Submitting this here and not in Yarn Jira because Hadoop Mapred Jobs run normally in the same cluster.

       

      Attachments

        1. spark.debug.log
          166 kB
          Babble Shack
        2. router-yarn-site.xml
          6 kB
          Babble Shack
        3. mapred.log
          693 kB
          Babble Shack
        4. spark.log
          129 kB
          Babble Shack
        5. spark.out
          25 kB
          Babble Shack
        6. mapred.out
          15 kB
          Babble Shack

        Activity

          People

            Unassigned Unassigned
            Babbleshack Babble Shack
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: