[SPARK-21084] Improvements to dynamic allocation for notebook use cases - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0, 2.3.0
Fix Version/s: None
Component/s: Block Manager, Scheduler, Spark Core, YARN
Labels:
- bulk-closed

Description

One important application of Spark is to support many notebook users with a single YARN or Spark Standalone cluster. We at IBM have seen this requirement across multiple deployments of Spark: on-premises and private cloud deployments at our clients, as well as on the IBM cloud. The scenario goes something like this: "Every morning at 9am, 500 analysts log into their computers and start running Spark notebooks intermittently for the next 8 hours." I'm sure that many other members of the community are interested in making similar scenarios work.

Dynamic allocation is supposed to support these kinds of use cases by shifting cluster resources towards users who are currently executing scalable code. In our own testing, we have encountered a number of issues with using the current implementation of dynamic allocation for this purpose:
Issue #1: Starvation. A Spark job acquires all available containers, preventing other jobs or applications from starting.
Issue #2: Request latency. Jobs that would normally finish in less than 30 seconds take 2-4x longer than normal with dynamic allocation.
Issue #3: Unfair resource allocation due to cached data. Applications that have cached RDD partitions hold onto executors indefinitely, denying those resources to other applications.
Issue #4: Loss of cached data leads to thrashing. Applications repeatedly lose partitions of cached RDDs because the underlying executors are removed; the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making enhancements to Spark.
Here's a high-level summary of the current planned work:

~~SPARK-21097~~: Preserve an executor's cached data when shutting down the executor.
~~SPARK-21122~~: Make Spark give up executors in a controlled fashion when the RM indicates it is running low on capacity.
(JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors.

Note that this overall plan is subject to change, and other members of the community should feel free to suggest changes and to help out.

Attachments

Issue Links

incorporates

SPARK-21097 Dynamic allocation will preserve cached data

Resolved

SPARK-21122 Address starvation issues when dynamic allocation is enabled

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Frederick Reiss

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 13/Jun/17 22:18

Updated:: 08/Oct/19 05:41

Resolved:: 08/Oct/19 05:41