[SPARK-46516] autoBroadcastJoinThreshold compared to project of a plan not a relation size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum size in bytes for a table that will be broadcasted to all worker nodes when performing a join.

https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration

In fact Spark compares plan.statistics.sizeInBytes of a project (columns selected in join), not a relation size.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368

Join can select only a few columns and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded entirely into drivers memory which can lead to OOM.

spark.sql.autoBroadcastJoinThreshold parameter compared to projection instead of broadcasted table size seems quite risky feature: there will be more broadcasted relations but more chances to get OOM on the driver too.

The solution is to disable spark.sql.autoBroadcastJoinThreshold and set hints on really small relations, but in that case autoBroadcastJoinThreshold seems useless. It would be more usefull to have autoBroadcastJoinThreshold which campres to relations size and have predicted memory usage on the driver.

Original task and test when autobroadcast compared to relation totalSize:

https://issues.apache.org/jira/browse/SPARK-2393

https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39

Task and PR where autoBroadcastJoinThreshold started to be compared to project of a plan instead of relations:

https://issues.apache.org/jira/browse/SPARK-13329

https://github.com/apache/spark/pull/11210

Related topic on SO: https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s