Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0, 3.4.1
-
None
-
None
Description
The code below is a minimal reproducible example of an issue I discovered with Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of those columns (per row) into a new column. This code works and returns in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 3.4.x when the number of columns grows. See below for execution timing summary as N varies.
import pyspark.sql.functions as F import random import string from functools import reduce from operator import add from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() # generate a dataframe N columns by M rows with random 8 digit column # names and random integers in [-5,10] N = 30 M = 100 columns = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(N)] data = [tuple([random.randint(-5,10) for _ in range(N)]) for _ in range(M)] df = spark.sparkContext.parallelize(data).toDF(columns) # 3 ways to add a sum column, all of them slow for high N in spark 3.4 df = df.withColumn("col_sum1", sum(df[col] for col in columns)) df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns])) df = df.withColumn("col_sum3", F.expr("+".join(columns)))
Timing results for Spark 3.3:
N | Exe Time (s) |
---|---|
5 | 0.514 |
10 | 0.248 |
15 | 0.327 |
20 | 0.403 |
25 | 0.279 |
30 | 0.322 |
50 | 0.430 |
Timing results for Spark 3.4:
N | Exe Time (s) |
---|---|
5 | 0.379 |
10 | 0.318 |
15 | 0.405 |
20 | 1.32 |
25 | 28.8 |
30 | 448 |
50 | >10000 (did not finish) |
Attachments
Issue Links
- duplicates
-
SPARK-45071 Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
- Resolved
- is fixed by
-
SPARK-45071 Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data
- Resolved