Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23361

Driver restart fails if it happens after 7 days from app submission

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.4.0
    • Component/s: YARN
    • Labels:
      None

      Description

      If you submit an app that is supposed to run for > 7 days (so using -\principal / --keytab in cluster mode), and there's a failure that causes the driver to restart after 7 days (that being the default token lifetime for HDFS), the new driver will fail with an error like the following:

      Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (lots of uninteresting token info) can't be found in cache
      	at org.apache.hadoop.ipc.Client.call(Client.java:1472)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1409)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
      	at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
      	at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
      	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
      	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
      

      Note: lines may not align with actual Apache code because that's our internal build.

      This happens because as part of the app submission, the launcher provides delegation tokens to be used by the AM (=driver in this case), and those are expired at that point in time.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                vanzin Marcelo Vanzin
                Reporter:
                vanzin Marcelo Vanzin
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: