[SPARK-44990] CSV conversion performance severely degraded for null fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
Fix Version/s: 3.4.2, 3.5.0, 3.3.4
Component/s: SQL
Labels:
None
Environment:

Hide

Ran on Spark 3.3.1/EMR 6.10.0 with driver r5.xlarge and 4 x r5.16xlarge core nodes. The workload was:

spark.read.parquet("<redacted HDFS location>").repartition(100).write.format("com.databricks.spark.csv").option("compression","gzip").option("header", "true").option("encoding","utf-8").option("charset","utf-8").option("escape", "").option("quote", "").option("quote", "\u0000").option("emptyValue", "").option("delimiter", "\t").mode("overwrite").save("<redacted HDFS location>")

Input data contained 5 parquet data files 41MB each.

Most of the fields were null values.

Schema was very wide (1099 columns).

Show
Ran on Spark 3.3.1/EMR 6.10.0 with driver r5.xlarge and 4 x r5.16xlarge core nodes. The workload was: spark.read.parquet("<redacted HDFS location>").repartition(100).write.format("com.databricks.spark.csv").option("compression","gzip").option("header", "true").option("encoding","utf-8").option("charset","utf-8").option("escape", "").option("quote", "").option("quote", "\u0000").option("emptyValue", "").option("delimiter", "\t").mode("overwrite").save("<redacted HDFS location>") Input data contained 5 parquet data files 41MB each. Most of the fields were null values. Schema was very wide (1099 columns).

Description

https://github.com/apache/spark/pull/36110/files
introduced a SQLConf access in a critical section for every field processed in a record that is null.

This causes severe degradation of performance causing one workload that was completing in a couple of seconds to now take around 8 minutes.

This conf needs to be moved out of the critical path, there's no need for it to be in this location.

The version of Spark prior to this commit didn't exhibit the slowdown. I also generated a patch on an affected version with the suspected line removed and the problem went away.

Attachments

Issue Links

links to

[Github] Pull Request #42744 (Hisoka-X)

Activity

People

Assignee:: Unassigned

Reporter:: Atul Felix Payapilly

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Aug/23 18:40

Updated:: 01/Sep/23 16:39

Resolved:: 30/Aug/23 17:55