Description
We saw these exceptions during block push:
22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block shuffle_170_568_174, and will not retry (0 retries) org.apache.spark.network.shuffle.BlockPushException: !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException: Active local dirs list has not been updated by any executor registration at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92) at org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300) at org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290) at org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312) at org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168) 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 to BlockManagerId(, node-x, 7337, None) failed.
Note: The NodeManager on node-x (node against which this exception was seen) was not restarted.
The reason this happened is because the executor registers the block manager with BlockManagerMaster before it registers with the ESS. In push-based shuffle, a block manager is selected by the driver as a merger for the shuffle push. However, the ESS on that node can successfully merge the block only if it has received the metadata about merged directories from the local executor (sent when the local executor registers with the ESS). If this local executor registration is delayed, but the ESS host got picked up as a merger then it will fail to merge the blocks pushed to it which is what happened here.
The local executor on node-x is executor 754 and the block manager registration happened at 13:28:11
22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has registered (new total is 1200)
22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
The application got registered with shuffle server at node-x at 13:29:40
2022-06-24 13:29:40,343 INFO org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, /grid/c/tmp/yarn/] for application application_1653753500486_3193550
node-x was selected as a merger by the driver after 13:28:11 and when the executors started pushing to it, all those pushes failed until 13:29:40
We can fix by having the executor register with ESS before it registers the block manager with the BlockManagerMaster