[SPARK-12358] Spark SQL query with lots of small tables under broadcast threshold leading to java.lang.OutOfMemoryError - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.5.2
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

I have a Spark SQL query with a lot of small tables (5x plus) all below the broadcast threshold. Looking at the query plan Spark is broadcasting all these tables together without checking if there is sufficient memory available. This leads to

Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space

errors and causes the executors to die and query fail.

I got around this issue by reducing the spark.sql.autoBroadcastJoinThreshold to stop broadcasting the bigger tables in the query.

A fix would be to
a) ensure that in addition to the per table threshold (spark.sql.autoBroadcastJoinThreshold), there is a total broadcast (say spark.sql.autoBroadcastJoinThresholdCumulative ) threshold per query, so only data up to that limit is broadcast preventing executors running out of memory.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Deenar Toraskar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Dec/15 06:32

Updated:: 21/May/19 04:36

Resolved:: 21/May/19 04:36