[SPARK-37618] Support cleaning up shuffle blocks from external shuffle service - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.3.0
Component/s: Shuffle, Spark Core
Labels:
- pull-request-available

Description

Currently shuffle data is not cleaned up when an external shuffle service is used and the associated executor has been deallocated before the shuffle is cleaned up. Shuffle data is only cleaned up once the application ends.

There have been various issues filed for this:

https://issues.apache.org/jira/browse/SPARK-26020

https://issues.apache.org/jira/browse/SPARK-17233

https://issues.apache.org/jira/browse/SPARK-4236

But shuffle files will still stick around until an application completes. Dynamic allocation is commonly used for long running jobs (such as structured streaming), so any long running jobs with a large shuffle involved will eventually fill up local disk space. The shuffle service already supports cleaning up shuffle service persisted RDDs, so it should be able to support cleaning up shuffle blocks as well once the shuffle is removed by the ContextCleaner.

The current alternative is to use shuffle tracking instead of an external shuffle service, but this is less optimal from a resource perspective as all executors must be kept alive until the shuffle has been fully consumed and cleaned up (and with the default GC interval being 30 minutes this can waste a lot of time with executors held onto but not doing anything).

Attachments

Issue Links

is related to

SPARK-47448 Enable spark.shuffle.service.removeShuffle by default

Resolved

relates to

SPARK-38005 Support cleaning up merged shuffle files and state from external shuffle service

Resolved

links to

[Github] Pull Request #35085 (Kimahriman)

[Github] Pull Request #36473 (mridulm)

GitHub Pull Request #35085

Activity

People

Assignee:: Adam Binford

Reporter:: Adam Binford

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Dec/21 15:40

Updated:: 30/Aug/24 11:20

Resolved:: 25/Mar/22 18:04