[SPARK-45745] Extremely slow execution of sum of columns in Spark 3.4.1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.4.1
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to Spark 3.4.1 and some code that was running fine is now basically never ending even for small dataframes.

We have simplified the problematic piece of code and the minimum pySpark example below shows the issue:

n_cols = 50
data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
df_data = sql_context.createDataFrame(data)

df_data = df_data.withColumn(
    "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
)
df_data.show(10, False)

Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the computation time seems to explode when the value of `n_cols` is bigger than about 25 columns. A colleague suggested that it could be related to the limit of 22 elements in a tuple in Scala 2.13 (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 columns are suspiciously close to this. Is there any known defect in the logical plan optimization in 3.4.1? Or is this kind of operations (sum of multiple columns) supposed to be implemented differently?

Attachments

Issue Links

duplicates

SPARK-45071 Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Javier

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/Oct/23 21:59

Updated:: 28/Aug/24 14:53

Resolved:: 28/Aug/24 14:53