[SPARK-17211] Broadcast join produces incorrect results when compressed Oops differs between driver, executor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: PySpark
Labels:
None

Description

Broadcast join produces incorrect columns in join result, see below for an example. The same join but without using broadcast gives the correct columns.

Running PySpark on YARN on Amazon EMR 5.0.0.

import pyspark.sql.functions as func

keys = [
    (54000000, 0),
    (54000001, 1),
    (54000002, 2),
]

keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
keys_df.show()
# +--------+-----+
# |  key_id|value|
# +--------+-----+
# |54000000|    0|
# |54000001|    1|
# |54000002|    2|
# +--------+-----+

data = [
    (54000002,    1),
    (54000000,    2),
    (54000001,    3),
]

data_df = spark.createDataFrame(data, ['key_id', 'foo'])
data_df.show()
# +--------+---+                                                                  
# |  key_id|foo|
# +--------+---+
# |54000002|  1|
# |54000000|  2|
# |54000001|  3|
# +--------+---+

### INCORRECT ###

data_df.join(func.broadcast(keys_df), 'key_id').show()
# +--------+---+--------+                                                         
# |  key_id|foo|   value|
# +--------+---+--------+
# |54000002|  1|54000002|
# |54000000|  2|54000000|
# |54000001|  3|54000001|
# +--------+---+--------+

### CORRECT ###

data_df.join(keys_df, 'key_id').show()
# +--------+---+-----+
# |  key_id|foo|value|
# +--------+---+-----+
# |54000000|  2|    0|
# |54000001|  3|    1|
# |54000002|  1|    2|
# +--------+---+-----+

Attachments

Issue Links

links to

[Github] Pull Request #14927 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Jarno Seppanen

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Aug/16 07:53

Updated:: 06/Sep/16 17:47

Resolved:: 06/Sep/16 17:47