Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46516

autoBroadcastJoinThreshold compared to project of a plan not a relation size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.1
    • None
    • SQL
    • None

    Description

      From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum size in bytes for a table that will be broadcasted to all worker nodes when performing a join.

      https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration

      In fact Spark compares plan.statistics.sizeInBytes  of a project (columns selected in join), not a relation size.

      https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368

      Join can select only a few columns and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded entirely into drivers memory which can lead to OOM.

      spark.sql.autoBroadcastJoinThreshold parameter compared to projection instead of broadcasted table size seems quite risky feature: there will be more broadcasted relations but more chances to get OOM on the driver too.

      The solution is to disable spark.sql.autoBroadcastJoinThreshold and set hints on really small relations, but in that case autoBroadcastJoinThreshold seems useless.  It would be more usefull to have autoBroadcastJoinThreshold which campres to relations size and have predicted memory usage on the driver.

       

      Original task and test when autobroadcast compared to relation totalSize:

      https://issues.apache.org/jira/browse/SPARK-2393

      https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39

       

      Task and PR where autoBroadcastJoinThreshold started to be compared to project of a plan instead of relations:

      https://issues.apache.org/jira/browse/SPARK-13329

      https://github.com/apache/spark/pull/11210

       

      Related topic on SO: https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gsavinov Guram Savinov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: