[SPARK-10063] Remove DirectParquetOutputCommitter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Target Version/s:

2.0.0

Description

When we use DirectParquetOutputCommitter on S3 and speculation is enabled, there is a chance that we can loss data.

Here is the code to reproduce the problem.

import org.apache.spark.sql.functions._
val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: Int, partitionId: Int, attemptNumber: Int) => {
  if (partitionId == 0 && i == 5) {
    if (attemptNumber > 0) {
      Thread.sleep(15000)
      throw new Exception("new exception")
    } else {
      Thread.sleep(10000)
    }
  }
  
  i
})

val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
  val context = org.apache.spark.TaskContext.get()
  val partitionId = context.partitionId
  val attemptNumber = context.attemptNumber
  iter.map(i => (i, partitionId, attemptNumber))
}.toDF("i", "partitionId", "attemptNumber")

df
  .select(failSpeculativeTask($"i", $"partitionId", $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
  .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")

sqlContext.read.load("/home/yin/outputCommitter").count
// The result is 99 and 5 is missing from the output.

What happened is that the original task finishes first and uploads its output file to S3, then the speculative task somehow fails. Because we have to call output stream's close method, which uploads data to S3, we actually uploads the partial result generated by the failed speculative task to S3 and this file overwrites the correct file generated by the original task.

Attachments

Issue Links

is related to

SPARK-8578 Should ignore user defined output committer when appending data

Resolved

relates to

SPARK-8413 DirectParquetOutputCommitter doesn't clean up the file on task failure

Closed

HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints

Resolved

SPARK-6352 Supporting non-default OutputCommitter when using saveAsParquetFile

Resolved

links to

[Github] Pull Request #12229 (rxin)

[Github] Pull Request #16796 (rxin)

[Github] Pull Request #18689 (cloud-fan)

[Github] Pull Request #18716 (cloud-fan)

(3 links to)

Activity

People

Assignee:: Reynold Xin

Reporter:: Yin Huai

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 17/Aug/15 18:38

Updated:: 06/Feb/18 12:25

Resolved:: 07/Apr/16 07:51