Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.1, 2.0.0
-
None
-
Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS
Description
I think I found a bug in the way columns are handled in (py)Spark
How to reproduce
df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b']) example = sc.parallelize([[1],[2]]).toDF(['id']) df_a = df.filter('a = "A"').alias('df_a') df_b = df.filter('b = "B"').alias('df_b') example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
Results in:
+---+---+-----+ | id| a| b| +---+---+-----+ | 1| A|Not B| +---+---+-----+
Expected result:
+---+---+---+ | id| a| b| +---+---+---+ | 1| A| B| +---+---+---+
When using the aliases in the select statement it does work properly
example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()
Results in expected result:
+---+---+---+ | id| a| b| +---+---+---+ | 1| A| B| +---+---+---+
I'm not sure if this is how you're supposed to select columns from this kind of Dataframe, but I think the first example should've worked just as fine.
I did some other experiments with this:
It also works when creating a new Dataframe using toDF():
df_a = df.filter('a = "A"').alias('df_a') df_b = df.filter('b = "B"').alias('df_b') df_a = df_a.toDF(*df_a.columns) df_b = df_b.toDF(*df_b.columns) example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
Results in expected result:
+---+---+---+ | id| a| b| +---+---+---+ | 1| A| B| +---+---+---+
But not when doing this with a select (which according to the docs, should return a new Dataframe)
df_a = df.filter('a = "A"').alias('df_a') df_b = df.filter('b = "B"').alias('df_b') df_a = df_a.select(*df_a.columns) df_b = df_b.select(*df_b.columns) example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
Results in:
+---+---+-----+ | id| a| b| +---+---+-----+ | 1| A|Not B| +---+---+-----+
At least something is unclear in the documentation here, and maybe this is a Column handing bug too.
Attachments
Issue Links
- relates to
-
SPARK-20073 Unexpected Cartesian product when using eqNullSafe in join with a derived table
- Resolved