Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24493

Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Later
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: Spark Core, YARN
    • Labels:
      None

      Description

      Kerberos Ticket Renewal is failing on long running spark job. I have added below 2 kerberos properties in the HDFS configuration and ran a spark streaming job (hdfs_wordcount.py)

      dfs.namenode.delegation.token.max-lifetime=1800000 (30min)
      dfs.namenode.delegation.token.renew-interval=900000 (15min)
      

       

      Spark Job failed at 15min with below error:

      18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 (TID 7290, <GatewayNodeHostname>, executor 1): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for abcd: HDFS_DELEGATION_TOKEN owner=abcd@EXAMPLE.COM, renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 18:56:51,276+0000 expected renewal time: 2018-06-04 18:56:13,875+0000
      at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
      at org.apache.hadoop.ipc.Client.call(Client.java:1445)
      at org.apache.hadoop.ipc.Client.call(Client.java:1355)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
      at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
      at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
      at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
      at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
      at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
      at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
      at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
      at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
      at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
      at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
      at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
      at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
      at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
      at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
      at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
      at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
      at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:99)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      at org.apache.spark.scheduler.Task.run(Task.scala:109)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      

      Steps to Reproduce:

      1. Add dfs.namenode.delegation.token.max-lifetime and dfs.namenode.delegation.token.renew-interval properties in the HDFS config and restart the affected services.
      2. Run spark streaming job on gateway node of the cluster in one terminal tab
        /bin/spark-submit --master yarn --principal <SPN> --keytab <Keytab File Full Path> hdfs_wordcount.py "/tmp/streaming_input" 2>&1 | tee driver.log

      After 15min the spark application is terminated with the above mentioned error.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                asifm37 Asif M
              • Votes:
                2 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: