Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Shuffle dependencies are created by shuffle map stages, which consists of files on disks and the corresponding references in Spark JVM heap memory. Currently Spark cleanup unused shuffle dependencies through JVM GCs, and periodic GCs are triggered once every 30 minutes (see ContextCleaner). However, we still found cases in which the size of the shuffle data files are too large, which makes shuffle data migration slow.
We do have chances to cleanup shuffle dependencies, especially for SQL queries created by Spark Connect, since we do have better control of the DataFrame instances there. Even if DataFrame instances are reused in the client side, on the server side the instances are still recreated.
We might also provide the option to 1. cleanup eagerly after each query executions, or 2. only mark the shuffle executions and do not migrate them at node decommissions.