Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.9.0
-
None
Description
If I call rdd.saveAsTextFile with an existing directory, it will cheerfully and silently overwrite the files in there. This is bad enough if it means I've accidentally blown away the results of a job that might have taken minutes or hours to run. But it's worse if the second job happens to have fewer partitions than the first...in that case, my output directory now contains some "part" files from the earlier job, and some "part" files from the later job. The only way to know the difference is timestamp.
I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce which insists that the output directory not exist before the job starts. Similarly HDFS won't override files by default. Perhaps there could be an optional argument for saveAsTextFile that indicates if it should delete the existing directory before starting. (I can't see any time I'd want to allow writing to an existing directory with data already in it. Would the mix of output from different tasks ever be desirable?)