[SPARK-16464] withColumn() allows illegal creation of duplicate column names on DataFrame - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: SparkR, SQL
Labels:
None
Environment:

Databricks.com

Description

If I take an existing DataFrame, I am permitted to use withColumn() to create a duplicate column name. I assume this should be illegal, and withColumn should be prevented from permitting this. Some functions subsequently fail due to the duplicate column names. Example:

sdfCar <- createDataFrame(sqlContext, mtcars)
sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == sdfCar1$mpg,1,0))

sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)

subset() command fails with message: "Reference 'isEfficient' is ambiguous"
Note: I only know if this is SparkR - it might affect other languages APIs.