[SPARK-17859] persist should not impede with spark's ability to perform a broadcast join. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Optimizer, SQL
Labels:
- bulk-closed
Environment:

spark 2.0.0 , Linux RedHat

Description

I am using Spark 2.0.0

My investigation leads me to conclude that calling persist could prevent broadcast join from happening .

Example
Case1: No persist call
var df1 =spark.range(1000000).select($"id".as("id1"))
df1: org.apache.spark.sql.DataFrame = [id1: bigint]

var df2 =spark.range(1000).select($"id".as("id2"))
df2: org.apache.spark.sql.DataFrame = [id2: bigint]

df1.join(df2 , $"id1" === $"id2" ).explain
== Physical Plan ==
*BroadcastHashJoin id1#117L, id2#123L, Inner, BuildRight
:- *Project id#114L AS id1#117L
: +- *Range (0, 1000000, splits=2)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *Project id#120L AS id2#123L
+- *Range (0, 1000, splits=2)

Case 2: persist call

df1.persist.join(df2 , $"id1" === $"id2" ).explain
16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
== Physical Plan ==
*SortMergeJoin id1#3L, id2#9L, Inner
:- *Sort id1#3L ASC, false, 0
: +- Exchange hashpartitioning(id1#3L, 10)
: +- InMemoryTableScan id1#3L
: : +- InMemoryRelation id1#3L, true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: : : +- *Project id#0L AS id1#3L
: : : +- *Range (0, 1000000, splits=2)
+- *Sort id2#9L ASC, false, 0
+- Exchange hashpartitioning(id2#9L, 10)
+- InMemoryTableScan id2#9L
: +- InMemoryRelation id2#9L, true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: : +- *Project id#6L AS id2#9L
: : +- *Range (0, 1000, splits=2)

Why does the persist call prevent the broadcast join .

My opinion is that it should not .

I was made aware that the persist call is lazy and that might have something to do with it , but I still contend that it should not .

Losing broadcast joins is really costly.

Attachments

Issue Links

is duplicated by

SPARK-21795 Broadcast hint ignored when dataframe is cached

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Franck Tago

Votes:: 2 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 10/Oct/16 22:53

Updated:: 25/May/21 01:53

Resolved:: 25/May/21 01:45