Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11282

Very strange broadcast join behaviour

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 1.5.1
    • None
    • PySpark, SQL
    • None

    Description

      Hi,
      I found very strange broadcast join behaviour.

      According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
      I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files )

      I found that working of this feature depends on Executor Memory.
      In my case broadcast join is working up to 31G.

      Example:

      spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true
      Creating test tables...
      Joining tables...
      Joined table schema:
      root
       |-- id: long (nullable = true)
       |-- val: long (nullable = true)
       |-- id2: long (nullable = true)
       |-- val2: long (nullable = true)
      
      Selecting data for id = 5...
      [Row(id=5, val=5, id2=5, val2=5)]
      spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true
      Creating test tables...
      Joining tables...
      Joined table schema:
      root
       |-- id: long (nullable = true)
       |-- val: long (nullable = true)
       |-- id2: long (nullable = true)
       |-- val2: long (nullable = true)
      
      Selecting data for id = 5...
      [Row(id=5, val=5, id2=None, val2=None)]
      

      Please find example code attached.

      Attachments

        1. SPARK-11282.py
          1 kB
          Maciej Bryński

        Issue Links

          Activity

            People

              Unassigned Unassigned
              maver1ck Maciej Bryński
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: