[SPARK-27542] SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: Input/Output
Labels:
- bulk-closed

Description

In Hadoop MapReduce, tasks call FileOutputFormat.setWorkOutputPath() after configuring the output committer: https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611

Spark doesn't do this: https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115

As a result, certain legacy output formats can fail to work out-of-the-box on Spark. In particular, org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat can fail with NullPointerExceptions, e.g.

java.lang.NullPointerException
  at org.apache.hadoop.fs.Path.<init>(Path.java:105)
  at org.apache.hadoop.fs.Path.<init>(Path.java:94)
  at org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
[...]
  at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)

It looks like someone on GitHub has hit the same problem: https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe

Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348

We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure of whether that change would pose compatibility risks for other existing workloads, though.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Apr/19 23:53

Updated:: 25/May/21 01:55

Resolved:: 25/May/21 01:38