[SPARK-15127] Column names are handled incorrectly when they originate from a single Dataframe - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.1, 2.0.0
Fix Version/s: None
Component/s: PySpark, Spark Core, SQL
Labels:
- bulk-closed
Environment:

Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS

Description

I think I found a bug in the way columns are handled in (py)Spark

How to reproduce

df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b'])

example = sc.parallelize([[1],[2]]).toDF(['id'])

df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')

example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()

Results in:

+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+

Expected result:

+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+

When using the aliases in the select statement it does work properly

example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()

Results in expected result:

+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+

I'm not sure if this is how you're supposed to select columns from this kind of Dataframe, but I think the first example should've worked just as fine.

I did some other experiments with this:

It also works when creating a new Dataframe using toDF():

df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.toDF(*df_a.columns)
df_b = df_b.toDF(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()

Results in expected result:

+---+---+---+
| id|  a|  b|
+---+---+---+
|  1|  A|  B|
+---+---+---+

But not when doing this with a select (which according to the docs, should return a new Dataframe)

df_a = df.filter('a = "A"').alias('df_a')
df_b = df.filter('b = "B"').alias('df_b')
df_a = df_a.select(*df_a.columns)
df_b = df_b.select(*df_b.columns)
example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()

Results in:

+---+---+-----+
| id|  a|    b|
+---+---+-----+
|  1|  A|Not B|
+---+---+-----+

At least something is unclear in the documentation here, and maybe this is a Column handing bug too.

Attachments

Issue Links

relates to

SPARK-20073 Unexpected Cartesian product when using eqNullSafe in join with a derived table

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jurriaan Pruis

Votes:: 2 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/May/16 17:57

Updated:: 21/May/19 04:14

Resolved:: 21/May/19 04:14