[SPARK-11282] Very strange broadcast join behaviour - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Duplicate
Affects Version/s: 1.5.1
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G.

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-11282.py
23/Oct/15 12:58
1 kB
Maciej Bryński

Issue Links

duplicates

SPARK-10914 UnsafeRow serialization breaks when two machines have different Oops size

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Maciej Bryński

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 23/Oct/15 12:56

Updated:: 23/Nov/15 14:13

Resolved:: 23/Oct/15 12:59