Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22218

spark shuffle services fails to update secret on application re-attempts

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: Shuffle, YARN
    • Labels:
      None

      Description

      Running on yarn, If you have any application re-attempts using the spark 2.2 shuffle service, the external shuffle service does not update the credentials properly and the application re-attempts fail with javax.security.sasl.SaslException.

      A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager to use containsKey (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50) , which is the proper behavior, the problem is that between application re-attempts it never removes the key. So when the second attempt starts, the code says it already contains the key (since the application id is the same) and it doesn't update the secret properly.

      to reproduce this you can run something like a word count and have the directory already existing. The first attempt will fail because the output directory exists, the subsequent attempts will fail with max number of executor failures. Note that this is assuming the second and third attempts run on the same node as the first attempt.

        Attachments

          Activity

            People

            • Assignee:
              tgraves Thomas Graves
              Reporter:
              tgraves Thomas Graves
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: