Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36446

YARN shuffle server restart crashes all dynamic allocation jobs that have deallocated an executor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.4.8, 3.1.2
    • None
    • Shuffle
    • None

    Description

      When dynamic allocation is enabled, executors that deallocate rely on the shuffle server to hold blocks and supply them to remaining executors.

      When YARN Shuffle Server restarts (either intentionally or due to a crash), it loses block information and relies on being able to contact Executors (the locations of which it durably stores) to refetch the list of blocks.

      This mutual dependency on the other to hold block information fails fatally under some common scenarios.

      For example, if a Spark application is running under dynamic allocation, some amount of executors will almost always shut down.

      If, after this has occurred, any shuffle server crashes, or is restarted (either directly when running as a standalone service, or as part of a YARN node manager restart) then there is no way to restore block data and it is permanently lost.

      Worse, when Executors try to fetch blocks from the shuffle server, the shuffle server cannot location the exeutor, decides it doesn't exist, treats it as a fatal exception, and causes the application to terminate and crash.

      Thus, in a real world scenario that we observe on a 1000+ node multi-tenant cluster  where dynamic allocation is on by default, a rolling restart of the YARN node managers will cause ALL jobs that have deallocated any executor and have shuffles or transferred blocks to the shuffle server in order to shut down, to crash.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adamkennedy77 Adam Kennedy
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: