Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45071

Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0, 3.5.0
    • 3.4.2, 4.0.0, 3.5.1
    • SQL
    • None

    Description

      Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed.

      For example, the following code:

      ```
          import spark.implicits._
          import scala.util.Random
          import org.apache.spark.sql.functions.sum
          import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
          val N = 30
          val M = 100
          val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
          val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
          val schema = StructType(columns.map(StructField(_, IntegerType)))
          val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
          val df = spark.createDataFrame(rdd, schema)
          val colExprs = columns.map(sum(_))
          // gen a new column , and add the other 30 column
          df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
      ```
      

       

      This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: SPARK-39316

      Attachments

        Issue Links

          Activity

            People

              Zing zzzzming95
              Zing zzzzming95
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: