[SPARK-11231] join returns schema with duplicated and ambiguous join columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.5.1
Fix Version/s: None
Component/s: SparkR
Labels:
- bulk-closed
Environment:

R

Description

In the case where the key column of two data frames are named the same thing, join returns a data frame where that column is duplicated. Since the content of the columns is guaranteed to be the same by row consolidating the identical columns into a single column would replicate standard R behavior[1] and help prevent ambiguous names.

Example:

> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> sdf1 <- createDataFrame(sqlContext, df1)
> sdf2 <- createDataFrame(sqlContext, df2)
> sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> schema(sjdf)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value1", type = "DoubleType", nullable = TRUE
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value2", type = "DoubleType", nullable = TRUE

The duplicated key columns cause things like:

> library(magrittr)
> sjdf %>% select("key")
15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could be: key#125, key#127.;
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
	at org.apache.spark.sql.catalyst.tree

[1] In base R there is no"join", but a similar function "merge" is provided in which a "by" argument identifies the shared key column in the two data frames. In the case where the key column names differ "by.x" and "by.y" arguments can be used. In the case of same-named key columns the consolidation behavior requested above is observed. In the case of differing names they "by.x" name is retained and consolidated with the "by.y" column which is dropped.

> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> merge(df1, df2, by="key")
  key value1 value2
1   A      1      4
2   B      2      5
3   C      3      6

df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> merge(df2, df3, by.x="key", by.y="akey")
  key value2 value1
1   A      4      1
2   B      5      2
3   C      6      3
> merge(df3, df2, by.x="akey", by.y="key")
  akey value1 value2
1    A      1      4
2    B      2      5
3    C      3      6

Attachments

Issue Links

duplicates

SPARK-10979 SparkR: Add merge to DataFrame

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Matt Pollock

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Oct/15 15:06

Updated:: 21/May/19 04:34

Resolved:: 21/May/19 04:34