When running in cluster mode, tasks run as a part of Spark Kudu client application can be devoid of getting new (i.e. non-expired) authentication tokens even if they run for a very short time. Essentially, if the driver runs longer than the authn token expiration interval and has a particular pattern of making RPC calls to Kudu masters and tablet servers, all tasks scheduled to run after the authn token expiration interval will be supplied with expired authn tokens, making every task fail. The only way to fix that is restarting the application or dropping long-established connections from the driver to Kudu masters/tservers.
Below are some details, explaining why that can happen.
Let's assume the following holds true for a Spark Kudu application:
- The application is running against a secured Kudu cluster.
- The application is running in the cluster mode.
- There are no primary authentication credentials at the machines for the user under which the Spark executors are running (i.e. kinit hasn't been run at those executor machines for the corresponding user or the Kerberos credentials has already expired there).
- The --authn_token_validity_seconds masters' flag is set to X seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days).
- The --rpc_default_keepalive_time_ms flag for masters (and tablet servers, if they are involved into the communications between the driver process and the Kudu backend) is set to Y milliseconds (default is 65000 ms).
- The application is running for longer than X seconds.
- The driver process makes requests to Kudu masters at least every Y milliseconds.
- The driver either doesn't make requests to Kudu tablet servers or makes such requests at least every Y milliseconds to each of the involved tablet servers.
- The executors are running tasks that keep connections to tablet servers idle for longer than Y milliseconds or the driver spawns tasks at an executor after Y milliseconds since last task has completed by the executor.
Essentially, that's about a Spark Kudu application where the driver process keeps once opened connections active and the executors need to open new connections to Kudu tablet servers (and/or masters). Also, the executor machines doesn't have Kerberos credentials for the OS user under which the executor processes are run.
In such scenarios, the application's tasks spawned after X seconds from the application start will fail because of expired authentication tokens, while the driver process will never re-acquire its authn token, keeping the expired token in KuduContext forever.