Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1100

saveAsTextFile shouldn't clobber by default

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • Input/Output
    • None

    Description

      If I call rdd.saveAsTextFile with an existing directory, it will cheerfully and silently overwrite the files in there. This is bad enough if it means I've accidentally blown away the results of a job that might have taken minutes or hours to run. But it's worse if the second job happens to have fewer partitions than the first...in that case, my output directory now contains some "part" files from the earlier job, and some "part" files from the later job. The only way to know the difference is timestamp.

      I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce which insists that the output directory not exist before the job starts. Similarly HDFS won't override files by default. Perhaps there could be an optional argument for saveAsTextFile that indicates if it should delete the existing directory before starting. (I can't see any time I'd want to allow writing to an existing directory with data already in it. Would the mix of output from different tasks ever be desirable?)

      Attachments

        Activity

          People

            pwendell Patrick Wendell
            dcarroll@cloudera.com Diana Carroll
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: