Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16464

withColumn() allows illegal creation of duplicate column names on DataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 1.6.1
    • None
    • SparkR, SQL
    • None
    • Databricks.com

    Description

      If I take an existing DataFrame, I am permitted to use withColumn() to create a duplicate column name. I assume this should be illegal, and withColumn should be prevented from permitting this. Some functions subsequently fail due to the duplicate column names. Example:

      sdfCar <- createDataFrame(sqlContext, mtcars)
      sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
      sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == sdfCar1$mpg,1,0))

      sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)

      1. subset() command fails with message: "Reference 'isEfficient' is ambiguous"
        Note: I only know if this is SparkR - it might affect other languages APIs.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              neil@dewar-us.com Neil Dewar
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: