Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
-
None
-
Ran on Spark 3.3.1/EMR 6.10.0 with driver r5.xlarge and 4 x r5.16xlarge core nodes. The workload was:
spark.read.parquet("<redacted HDFS location>").repartition(100).write.format("com.databricks.spark.csv").option("compression","gzip").option("header", "true").option("encoding","utf-8").option("charset","utf-8").option("escape", "").option("quote", "").option("quote", "\u0000").option("emptyValue", "").option("delimiter", "\t").mode("overwrite").save("<redacted HDFS location>")
Input data contained 5 parquet data files 41MB each.
Most of the fields were null values.
Schema was very wide (1099 columns).
Ran on Spark 3.3.1/EMR 6.10.0 with driver r5.xlarge and 4 x r5.16xlarge core nodes. The workload was: spark.read.parquet("<redacted HDFS location>").repartition(100).write.format("com.databricks.spark.csv").option("compression","gzip").option("header", "true").option("encoding","utf-8").option("charset","utf-8").option("escape", "").option("quote", "").option("quote", "\u0000").option("emptyValue", "").option("delimiter", "\t").mode("overwrite").save("<redacted HDFS location>") Input data contained 5 parquet data files 41MB each. Most of the fields were null values. Schema was very wide (1099 columns).
Description
https://github.com/apache/spark/pull/36110/files
introduced a SQLConf access in a critical section for every field processed in a record that is null.
This causes severe degradation of performance causing one workload that was completing in a couple of seconds to now take around 8 minutes.
This conf needs to be moved out of the critical path, there's no need for it to be in this location.
The version of Spark prior to this commit didn't exhibit the slowdown. I also generated a patch on an affected version with the suspected line removed and the problem went away.