Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16464

withColumn() allows illegal creation of duplicate column names on DataFrame

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 1.6.1
    • None
    • SparkR, SQL
    • None
    • Databricks.com

    Description

      If I take an existing DataFrame, I am permitted to use withColumn() to create a duplicate column name. I assume this should be illegal, and withColumn should be prevented from permitting this. Some functions subsequently fail due to the duplicate column names. Example:

      sdfCar <- createDataFrame(sqlContext, mtcars)
      sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
      sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg == sdfCar1$mpg,1,0))

      sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)

      1. subset() command fails with message: "Reference 'isEfficient' is ambiguous"
        Note: I only know if this is SparkR - it might affect other languages APIs.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            neil@dewar-us.com Neil Dewar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment