Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27210

Cleanup incomplete output files in ManifestFileCommitProtocol if task is aborted

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • Structured Streaming
    • None

    Description

      Unlike HadoopMapReduceCommitProtocol, ManifestFileCommitProtocol doesn't clean up incomplete output files for both cases: task is aborted as well as job is aborted.

      In HadoopMapReduceCommitProtocol, it leverages stage directory to write intermediate files so once job is aborted it can simply delete stage directory to clean up everything. Even HadoopMapReduceCommitProtocol puts more effort on cleaning up intermediate files on task side if task is aborted.

      ManifestFileCommitProtocol doesn't do anything for cleaning up but just maintains the metadata which list of complete output files are written. It should be better if ManifestFileCommitProtocol can do the best effort to clean up: not sure it can do job level cleanup since it doesn't leverage stage directory, but it's clear that it can still put best effort to do task level cleanup.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: