Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
2.2.0
-
None
-
None
Description
Currently, withColumn claims to do the following: "adding a column or replacing the existing column that has the same name."
Unfortunately, if multiple existing columns have the same name (which is a normal occurrence after a join), this results in multiple replaced – and retained –
columns (with the same value), and messages about an ambiguous column.
The current implementation of withColumn contains this:
def withColumn(colName: String, col: Column): DataFrame = { val resolver = sparkSession.sessionState.analyzer.resolver val output = queryExecution.analyzed.output val shouldReplace = output.exists(f => resolver(f.name, colName)) if (shouldReplace) { val columns = output.map { field => if (resolver(field.name, colName)) { col.as(colName) } else { Column(field) } } select(columns : _*) } else { select(Column("*"), col.as(colName)) } }
Instead, suggest something like this (which replaces all matching fields with a single instance of the new one):
def withColumn(colName: String, col: Column): DataFrame = { val resolver = sparkSession.sessionState.analyzer.resolver val output = queryExecution.analyzed.output val existing = output.filterNot(f => resolver(f.name, colName)).map(new Column(_)) select(existing :+ col.as(colName): _*) }