Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17859

persist should not impede with spark's ability to perform a broadcast join.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0
    • None
    • Optimizer, SQL
    • spark 2.0.0 , Linux RedHat

    Description

      I am using Spark 2.0.0

      My investigation leads me to conclude that calling persist could prevent broadcast join from happening .

      Example
      Case1: No persist call
      var df1 =spark.range(1000000).select($"id".as("id1"))
      df1: org.apache.spark.sql.DataFrame = [id1: bigint]

      var df2 =spark.range(1000).select($"id".as("id2"))
      df2: org.apache.spark.sql.DataFrame = [id2: bigint]

      df1.join(df2 , $"id1" === $"id2" ).explain
      == Physical Plan ==
      *BroadcastHashJoin id1#117L, id2#123L, Inner, BuildRight
      :- *Project id#114L AS id1#117L
      : +- *Range (0, 1000000, splits=2)
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *Project id#120L AS id2#123L
      +- *Range (0, 1000, splits=2)

      Case 2: persist call

      df1.persist.join(df2 , $"id1" === $"id2" ).explain
      16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
      == Physical Plan ==
      *SortMergeJoin id1#3L, id2#9L, Inner
      :- *Sort id1#3L ASC, false, 0
      : +- Exchange hashpartitioning(id1#3L, 10)
      : +- InMemoryTableScan id1#3L
      : : +- InMemoryRelation id1#3L, true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      : : : +- *Project id#0L AS id1#3L
      : : : +- *Range (0, 1000000, splits=2)
      +- *Sort id2#9L ASC, false, 0
      +- Exchange hashpartitioning(id2#9L, 10)
      +- InMemoryTableScan id2#9L
      : +- InMemoryRelation id2#9L, true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      : : +- *Project id#6L AS id2#9L
      : : +- *Range (0, 1000, splits=2)

      Why does the persist call prevent the broadcast join .

      My opinion is that it should not .

      I was made aware that the persist call is lazy and that might have something to do with it , but I still contend that it should not .

      Losing broadcast joins is really costly.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tafranky@gmail.com Franck Tago
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: