Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44912

Spark 3.4 multi-column sum slows with many columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0, 3.4.1
    • None
    • PySpark
    • None

    Description

      The code below is a minimal reproducible example of an issue I discovered with Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of those columns (per row) into a new column. This code works and returns in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 3.4.x when the number of columns grows. See below for execution timing summary as N varies.

      import pyspark.sql.functions as F
      import random
      import string
      from functools import reduce
      from operator import add
      from pyspark.sql import SparkSession
      
      spark = SparkSession.builder.getOrCreate()
      
      # generate a dataframe N columns by M rows with random 8 digit column 
      # names and random integers in [-5,10]
      N = 30
      M = 100
      columns = [''.join(random.choices(string.ascii_uppercase +
                                        string.digits, k=8))
                 for _ in range(N)]
      data = [tuple([random.randint(-5,10) for _ in range(N)])
              for _ in range(M)]
      
      df = spark.sparkContext.parallelize(data).toDF(columns)
      # 3 ways to add a sum column, all of them slow for high N in spark 3.4
      df = df.withColumn("col_sum1", sum(df[col] for col in columns))
      df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
      df = df.withColumn("col_sum3", F.expr("+".join(columns))) 

      Timing results for Spark 3.3:

      N Exe Time (s)
      5 0.514
      10 0.248
      15 0.327
      20 0.403
      25 0.279
      30 0.322
      50 0.430

      Timing results for Spark 3.4:

      N Exe Time (s)
      5 0.379
      10 0.318
      15 0.405
      20 1.32
      25 28.8
      30 448
      50 >10000 (did not finish)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              brbickel Brady Bickel
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: