Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10063

Remove DirectParquetOutputCommitter

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      When we use DirectParquetOutputCommitter on S3 and speculation is enabled, there is a chance that we can loss data.

      Here is the code to reproduce the problem.

      import org.apache.spark.sql.functions._
      val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: Int, partitionId: Int, attemptNumber: Int) => {
        if (partitionId == 0 && i == 5) {
          if (attemptNumber > 0) {
            Thread.sleep(15000)
            throw new Exception("new exception")
          } else {
            Thread.sleep(10000)
          }
        }
        
        i
      })
      
      val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
        val context = org.apache.spark.TaskContext.get()
        val partitionId = context.partitionId
        val attemptNumber = context.attemptNumber
        iter.map(i => (i, partitionId, attemptNumber))
      }.toDF("i", "partitionId", "attemptNumber")
      
      df
        .select(failSpeculativeTask($"i", $"partitionId", $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
        .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
      
      sqlContext.read.load("/home/yin/outputCommitter").count
      // The result is 99 and 5 is missing from the output.
      

      What happened is that the original task finishes first and uploads its output file to S3, then the speculative task somehow fails. Because we have to call output stream's close method, which uploads data to S3, we actually uploads the partial result generated by the failed speculative task to S3 and this file overwrites the correct file generated by the original task.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                rxin Reynold Xin
                Reporter:
                yhuai Yin Huai
              • Votes:
                0 Vote for this issue
                Watchers:
                19 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: