Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24942

Improve cluster resource management with jobs containing barrier stage

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Spark Core
    • None

    Description

      https://github.com/apache/spark/pull/21758#discussion_r205652317

      We shall improve cluster resource management to address the following issues:

      • With dynamic resource allocation enabled, it may happen that we acquire some executors (but not enough to launch all the tasks in a barrier stage) and later release them due to executor idle time expire, and then acquire again.
      • There can be deadlock with two concurrent applications. Each application may acquire some resources, but not enough to launch all the tasks in a barrier stage. And after hitting the idle timeout and releasing them, they may acquire resources again, but just continually trade resources between each other.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jiangxb1987 Xingbo Jiang
            Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated: