[SPARK-11757] Incorrect join output for joining two dataframes loaded from Parquet format - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 2.0.0
Component/s: PySpark, SQL
Labels:
- dataframe
- emr
- join
- pyspark
Environment:

Python 2.7, Spark 1.5.0, Amazon linux ami https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/

Description

Reading in dataframes from Parquet format in s3, and executing a join between them fails when evoked by column name. Works correctly if a join condition is used instead:

sqlContext = SQLContext(sc)
a = sqlContext.read.parquet('s3://path-to-data-a/')
b = sqlContext.read.parquet('s3://path-to-data-b/')

# result 0 rows
c = a.join(b, on='id', how='left_outer')
c.count() 

# correct output
d = a.join(b, a['id']==b['id'], how='left_outer')
d.count()