[SPARK-18944] Understanding BroadcastNestedLoopJoin and number of partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Resolved
Priority: Trivial
Resolution: Invalid
Affects Version/s: 1.6.2, 2.0.2
Fix Version/s: None
Component/s: SQL
Labels:
- question
Environment:

Spark 1.6.2

Description

I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big dataframe is 200 while small dataframe has 5 partitions.
The joined dataframe results with 205 partitions (joined.rdd.partitions.size), I have tried to understand why is this number and figured out that BroadCastNestedLoopJoin is actually a union.

code :
case class BroadcastNestedLoopJoin{
def doExecuteo(): =

{ ... ... sparkContext.union( matchedStreamRows, sparkContext.makeRDD(notMatchedBroadcastRows) ) }

}

can someone explain what exactly the code of doExecute() do? can you elaborate about all the null checks and why can we have nulls ? Why do we have 205 partions? link to a JIRA with discussion that can explain the code can help.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: David Hodeffi

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Dec/16 12:48

Updated:: 20/Dec/16 13:03

Resolved:: 20/Dec/16 13:03