Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18944

Understanding BroadcastNestedLoopJoin and number of partitions

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Trivial
    • Resolution: Invalid
    • 1.6.2, 2.0.2
    • None
    • SQL
    • Spark 1.6.2

    Description

      I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin.
      number of partitions for the big dataframe is 200 while small dataframe has 5 partitions.
      The joined dataframe results with 205 partitions (joined.rdd.partitions.size), I have tried to understand why is this number and figured out that BroadCastNestedLoopJoin is actually a union.

      code :
      case class BroadcastNestedLoopJoin{
      def doExecuteo(): =

      { ... ... sparkContext.union( matchedStreamRows, sparkContext.makeRDD(notMatchedBroadcastRows) ) }

      }

      can someone explain what exactly the code of doExecute() do? can you elaborate about all the null checks and why can we have nulls ? Why do we have 205 partions? link to a JIRA with discussion that can explain the code can help.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Davidho David Hodeffi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: