[SPARK-47764] Cleanup shuffle dependencies for Spark Connect SQL executions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Spark Core, SQL
Labels:
- pull-request-available

Description

Shuffle dependencies are created by shuffle map stages, which consists of files on disks and the corresponding references in Spark JVM heap memory. Currently Spark cleanup unused shuffle dependencies through JVM GCs, and periodic GCs are triggered once every 30 minutes (see ContextCleaner). However, we still found cases in which the size of the shuffle data files are too large, which makes shuffle data migration slow.

We do have chances to cleanup shuffle dependencies, especially for SQL queries created by Spark Connect, since we do have better control of the DataFrame instances there. Even if DataFrame instances are reused in the client side, on the server side the instances are still recreated.

We might also provide the option to 1. cleanup eagerly after each query executions, or 2. only mark the shuffle executions and do not migrate them at node decommissions.

Attachments

Issue Links

links to

GitHub Pull Request #45930

GitHub Pull Request #46302

Activity

People

Assignee:: Bo Zhang

Reporter:: Bo Zhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Apr/24 12:51

Updated:: 24/Jul/24 23:53

Resolved:: 24/Apr/24 08:14