Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8413

DirectParquetOutputCommitter doesn't clean up the file on task failure

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Not A Problem
    • Affects Version/s: 1.3.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      Here are the steps that lead to the failure.

      1. Write a DataFrame using DirectParquetOutputCommitter
      2. 1st attempt fails during the writes. e.g. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L355
      3. There is no clean-up logic on task failure, so the parquet part file written by the failed task is left half-written.
      4. 2nd attempt fails with the following exception because the target file already exists.

      2015-06-15T15:37:32.703 WARN [task-result-getter-2] org.apache.spark.scheduler.TaskSetManager - Lost task 56.1 in stage 7.0 (TID 73125, <REDACTED>): java.io.IOException: File already exists:s3://<REDACTED>
              at <REDACTED>
      	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851)
      	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832)
      	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731)
      	at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
      	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
      	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
      	at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:350)
      	at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:371)
      	at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:371)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:64)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mkim Mingyu Kim
                Shepherd:
                Cheng Lian
              • Votes:
                4 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: