Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22218

spark shuffle services fails to update secret on application re-attempts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.2.1
    • 2.2.1, 2.3.0
    • Shuffle, Spark Core, YARN
    • None

    Description

      Running on yarn, If you have any application re-attempts using the spark 2.2 shuffle service, the external shuffle service does not update the credentials properly and the application re-attempts fail with javax.security.sasl.SaslException.

      A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager to use containsKey (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50) , which is the proper behavior, the problem is that between application re-attempts it never removes the key. So when the second attempt starts, the code says it already contains the key (since the application id is the same) and it doesn't update the secret properly.

      to reproduce this you can run something like a word count and have the directory already existing. The first attempt will fail because the output directory exists, the subsequent attempts will fail with max number of executor failures. Note that this is assuming the second and third attempts run on the same node as the first attempt.

      Attachments

        Activity

          People

            tgraves Thomas Graves
            tgraves Thomas Graves
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: