Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27542

SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:

      Description

      In Hadoop MapReduce, tasks call FileOutputFormat.setWorkOutputPath() after configuring the  output committer: https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611 

      Spark doesn't do this: https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115

      As a result, certain legacy output formats can fail to work out-of-the-box on Spark. In particular, org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat can fail with NullPointerExceptions, e.g.

      java.lang.NullPointerException
        at org.apache.hadoop.fs.Path.<init>(Path.java:105)
        at org.apache.hadoop.fs.Path.<init>(Path.java:94)
        at org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
      [...]
        at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
      

      It looks like someone on GitHub has hit the same problem: https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe

      Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348

      We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure of whether that change would pose compatibility risks for other existing workloads, though.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: