Details
Description
Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. Issues are faced when users iteratively perform `withColumnRenamed`.
- When it works, we see slower performace
- In some cases, StackOverflowError is raised due to logical plan being too big
- In a few cases, driver died due to memory consumption
Some reproducible benchmarks:
import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021)
import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) g = datetime.datetime.now() g-a datetime.timedelta(microseconds=632116)