Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27542

SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • Input/Output

    Description

      In Hadoop MapReduce, tasks call FileOutputFormat.setWorkOutputPath() after configuring the  output committer: https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611 

      Spark doesn't do this: https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115

      As a result, certain legacy output formats can fail to work out-of-the-box on Spark. In particular, org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat can fail with NullPointerExceptions, e.g.

      java.lang.NullPointerException
        at org.apache.hadoop.fs.Path.<init>(Path.java:105)
        at org.apache.hadoop.fs.Path.<init>(Path.java:94)
        at org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
      [...]
        at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
      

      It looks like someone on GitHub has hit the same problem: https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe

      Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348

      We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure of whether that change would pose compatibility risks for other existing workloads, though.

      Attachments

        Activity

          People

            Unassigned Unassigned
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: