[SPARK-44912] Spark 3.4 multi-column sum slows with many columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0, 3.4.1
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

The code below is a minimal reproducible example of an issue I discovered with Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of those columns (per row) into a new column. This code works and returns in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 3.4.x when the number of columns grows. See below for execution timing summary as N varies.

import pyspark.sql.functions as F
import random
import string
from functools import reduce
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# generate a dataframe N columns by M rows with random 8 digit column 
# names and random integers in [-5,10]
N = 30
M = 100
columns = [''.join(random.choices(string.ascii_uppercase +
                                  string.digits, k=8))
           for _ in range(N)]
data = [tuple([random.randint(-5,10) for _ in range(N)])
        for _ in range(M)]

df = spark.sparkContext.parallelize(data).toDF(columns)
# 3 ways to add a sum column, all of them slow for high N in spark 3.4
df = df.withColumn("col_sum1", sum(df[col] for col in columns))
df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
df = df.withColumn("col_sum3", F.expr("+".join(columns)))

Timing results for Spark 3.3:

N	Exe Time (s)
5	0.514
10	0.248
15	0.327
20	0.403
25	0.279
30	0.322
50	0.430

Timing results for Spark 3.4:

N	Exe Time (s)
5	0.379
10	0.318
15	0.405
20	1.32
25	28.8
30	448
50	>10000 (did not finish)

Attachments

Issue Links

duplicates

SPARK-45071 Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Resolved

is fixed by

SPARK-45071 Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Brady Bickel

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Aug/23 14:16

Updated:: 11/Sep/23 19:43

Resolved:: 11/Sep/23 19:43