[SPARK-11757] Incorrect join output for joining two dataframes loaded from Parquet format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 2.0.0
Component/s: PySpark, SQL
Labels:
- dataframe
- emr
- join
- pyspark
Environment:

Python 2.7, Spark 1.5.0, Amazon linux ami https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/

Description

Reading in dataframes from Parquet format in s3, and executing a join between them fails when evoked by column name. Works correctly if a join condition is used instead:

sqlContext = SQLContext(sc)
a = sqlContext.read.parquet('s3://path-to-data-a/')
b = sqlContext.read.parquet('s3://path-to-data-b/')

# result 0 rows
c = a.join(b, on='id', how='left_outer')
c.count() 

# correct output
d = a.join(b, a['id']==b['id'], how='left_outer')
d.count()

Attachments

Issue Links

relates to

SPARK-13427 Support USING clause in JOIN

Resolved

Activity

People

Assignee:: Dilip Biswal

Reporter:: Petri Kärkäs

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Nov/15 12:38

Updated:: 28/Apr/16 15:09

Resolved:: 27/Apr/16 22:38