[SPARK-45071] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0, 3.5.0
Fix Version/s: 3.4.2, 4.0.0, 3.5.1
Component/s: SQL
Labels:
None

Description

Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed.

For example, the following code:

```
    import spark.implicits._
    import scala.util.Random
    import org.apache.spark.sql.functions.sum
    import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
    val N = 30
    val M = 100
    val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
    val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
    val schema = StructType(columns.map(StructField(_, IntegerType)))
    val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
    val df = spark.createDataFrame(rdd, schema)
    val colExprs = columns.map(sum(_))
    // gen a new column , and add the other 30 column
    df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
```

This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: ~~SPARK-39316~~

Attachments

Issue Links

fixes

SPARK-44912 Spark 3.4 multi-column sum slows with many columns

Resolved

is duplicated by

SPARK-44912 Spark 3.4 multi-column sum slows with many columns

Resolved

SPARK-45745 Extremely slow execution of sum of columns in Spark 3.4.1

Resolved

links to

[Github] Pull Request #42804 (zzzzming95)

Activity

People

Assignee:: zzzzming95

Reporter:: zzzzming95

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Sep/23 13:23

Updated:: 28/Aug/24 14:53

Resolved:: 06/Sep/23 03:39