Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Not A Problem
-
None
-
None
Description
We recently learnt the hard way (in a prod system) that Spark by default does not delete its temporary files until it is stopped. WIthin a relatively short time span of heavy Spark use the disk of our prod machine filled up completely because of multiple shuffle files written to it. We think there should be better documentation around the fact that after a job is finished it leaves a lot of rubbish behind so that this does not come as a surprise.
Probably a good place to highlight that fact would be the documentation of spark.local.dir property, which controls where Spark temporary files are written.
Attachments
Issue Links
- is duplicated by
-
SPARK-3563 Shuffle data not always be cleaned
- Resolved
-
SPARK-4796 Spark does not remove temp files
- Resolved
-
SPARK-6011 Out of disk space due to Spark not deleting shuffle files of lost executors
- Resolved
- is related to
-
SPARK-31208 Expose the ability for user to cleanup shuffle files
- Resolved
- links to