Details
Description
Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed.
For example, the following code:
``` import spark.implicits._ import scala.util.Random import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ```
This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: SPARK-39316
Attachments
Issue Links
- fixes
-
SPARK-44912 Spark 3.4 multi-column sum slows with many columns
- Resolved
- is duplicated by
-
SPARK-44912 Spark 3.4 multi-column sum slows with many columns
- Resolved
-
SPARK-45745 Extremely slow execution of sum of columns in Spark 3.4.1
- Resolved
- links to