Details
Description
I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big dataframe is 200 while small dataframe has 5 partitions.
The joined dataframe results with 205 partitions (joined.rdd.partitions.size), I have tried to understand why is this number and figured out that BroadCastNestedLoopJoin is actually a union.
code :
case class BroadcastNestedLoopJoin{
def doExecuteo(): =
}
can someone explain what exactly the code of doExecute() do? can you elaborate about all the null checks and why can we have nulls ? Why do we have 205 partions? link to a JIRA with discussion that can explain the code can help.