Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.2.1
-
None
Description
Running on yarn, If you have any application re-attempts using the spark 2.2 shuffle service, the external shuffle service does not update the credentials properly and the application re-attempts fail with javax.security.sasl.SaslException.
A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager to use containsKey (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50) , which is the proper behavior, the problem is that between application re-attempts it never removes the key. So when the second attempt starts, the code says it already contains the key (since the application id is the same) and it doesn't update the secret properly.
to reproduce this you can run something like a word count and have the directory already existing. The first attempt will fail because the output directory exists, the subsequent attempts will fail with max number of executor failures. Note that this is assuming the second and third attempts run on the same node as the first attempt.