Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.1
-
None
-
Hive table with Spark 2.3.1 on Azure, using Azure storage as storage layer
Description
I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error.
Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760
Then we proceed to perform query. In the SQL plan, we found that one table that is 25MB in size is broadcast as well.
Also in desc extended the table is 24452111 bytes. It is a Hive table. We always ran into error when this table being broadcast. Below is the sample error
Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
Also attached the physical plan if you're interested. One thing to note that, if I turn down autoBroadcastJoinThresholdto 5MB, this query will get successfully executed and default.product NOT broadcasted.
However, when I change to another query that querying even less columns than pervious one, even in 5MB this table still get broadcasted and failed with the same error. I even changed to 1MB and still the same.