Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7953

Spark should cleanup output dir if job fails

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 1.3.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:

      Description

      MR calls abortTask and abortJob on the OutputCommitter to clean up the temporary output directories, but Spark doesn't seem to be doing that (when outputting an RDD to a Hadoop FS)

      For example: PairRDDFunctions.saveAsNewAPIHadoopDataset should call committer.abortTask(hadoopContext) in the finally block inside the writeShard closure. And also jobCommitter.abortJob(jobTaskContext, JobStatus.State.FAILED) should be called if the job fails.

      Additionally, MR removes the output dir if job fails, but Spark doesn't.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mohitsabharwal Mohit Sabharwal

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment