Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11231

join returns schema with duplicated and ambiguous join columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.1
    • None
    • SparkR
    • R

    Description

      In the case where the key column of two data frames are named the same thing, join returns a data frame where that column is duplicated. Since the content of the columns is guaranteed to be the same by row consolidating the identical columns into a single column would replicate standard R behavior[1] and help prevent ambiguous names.

      Example:

      > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
      > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
      > sdf1 <- createDataFrame(sqlContext, df1)
      > sdf2 <- createDataFrame(sqlContext, df2)
      > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
      > schema(sjdf)
      StructType
      |-name = "key", type = "StringType", nullable = TRUE
      |-name = "value1", type = "DoubleType", nullable = TRUE
      |-name = "key", type = "StringType", nullable = TRUE
      |-name = "value2", type = "DoubleType", nullable = TRUE
      

      The duplicated key columns cause things like:

      > library(magrittr)
      > sjdf %>% select("key")
      15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
        org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could be: key#125, key#127.;
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
      	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
      	at org.apache.spark.sql.catalyst.tree
      

      [1] In base R there is no"join", but a similar function "merge" is provided in which a "by" argument identifies the shared key column in the two data frames. In the case where the key column names differ "by.x" and "by.y" arguments can be used. In the case of same-named key columns the consolidation behavior requested above is observed. In the case of differing names they "by.x" name is retained and consolidated with the "by.y" column which is dropped.

      > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
      > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
      > merge(df1, df2, by="key")
        key value1 value2
      1   A      1      4
      2   B      2      5
      3   C      3      6
      
      df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
      > merge(df2, df3, by.x="key", by.y="akey")
        key value2 value1
      1   A      4      1
      2   B      5      2
      3   C      6      3
      > merge(df3, df2, by.x="akey", by.y="key")
        akey value1 value2
      1    A      1      4
      2    B      2      5
      3    C      3      6
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mpollock Matt Pollock
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: