Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40311

Introduce withColumnsRenamed

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.3, 3.1.3, 3.3.0, 3.2.2
    • 3.4.0
    • PySpark, SparkR, SQL
    • None
    • Add withColumnsRenamed to scala and pyspark API

    Description

      Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. Issues are faced when users iteratively perform `withColumnRenamed`.

      • When it works, we see slower performace
      • In some cases, StackOverflowError is raised due to logical plan being too big
      • In a few cases, driver died due to memory consumption

      Some reproducible benchmarks:

      import datetime
      import numpy as np
      import pandas as pd
      
      num_rows = 2
      num_columns = 100
      data = np.zeros((num_rows, num_columns))
      columns = map(str, range(num_columns))
      raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
      
      a = datetime.datetime.now()
      for col in raw.columns:
          raw = raw.withColumnRenamed(col, f"prefix_{col}")
      
      b = datetime.datetime.now()
      for col in raw.columns:
          raw = raw.withColumnRenamed(col, f"prefix_{col}")
      
      c = datetime.datetime.now()
      for col in raw.columns:
          raw = raw.withColumnRenamed(col, f"prefix_{col}")
      
      d = datetime.datetime.now()
      for col in raw.columns:
          raw = raw.withColumnRenamed(col, f"prefix_{col}")
      
      e = datetime.datetime.now()
      for col in raw.columns:
          raw = raw.withColumnRenamed(col, f"prefix_{col}")
      
      f = datetime.datetime.now()
      for col in raw.columns:
          raw = raw.withColumnRenamed(col, f"prefix_{col}")
      
      g = datetime.datetime.now()
      g-a
      datetime.timedelta(seconds=12, microseconds=480021) 
      import datetime
      import numpy as np
      import pandas as pd
      
      num_rows = 2
      num_columns = 100
      data = np.zeros((num_rows, num_columns))
      columns = map(str, range(num_columns))
      raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
      
      a = datetime.datetime.now()
      raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark)
      b = datetime.datetime.now()
      raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark)
      c = datetime.datetime.now()
      raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark)
      d = datetime.datetime.now()
      raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark)
      e = datetime.datetime.now()
      raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark)
      f = datetime.datetime.now()
      raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark)
      g = datetime.datetime.now()
      g-a
      datetime.timedelta(microseconds=632116) 

      Attachments

        Activity

          People

            santosh.pingale Santosh Pingale
            santosh.pingale Santosh Pingale
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: