[SPARK-1100] saveAsTextFile shouldn't clobber by default - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: Input/Output
Labels:
None

Description

If I call rdd.saveAsTextFile with an existing directory, it will cheerfully and silently overwrite the files in there. This is bad enough if it means I've accidentally blown away the results of a job that might have taken minutes or hours to run. But it's worse if the second job happens to have fewer partitions than the first...in that case, my output directory now contains some "part" files from the earlier job, and some "part" files from the later job. The only way to know the difference is timestamp.

I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce which insists that the output directory not exist before the job starts. Similarly HDFS won't override files by default. Perhaps there could be an optional argument for saveAsTextFile that indicates if it should delete the existing directory before starting. (I can't see any time I'd want to allow writing to an existing directory with data already in it. Would the mix of output from different tasks ever be desirable?)

Attachments

Activity

People

Assignee:: Patrick Wendell

Reporter:: Diana Carroll

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Feb/14 09:45

Updated:: 29/Apr/14 18:50

Resolved:: 01/Mar/14 17:29