Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0
-
None
Description
When shuffle files not found, decommissioner will handles IOException, but the real exception is as below:
22/08/10 18:05:34 ERROR BlockManagerDecommissioner: Error occurred during migrating migrate_shuffle_1_356 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4(BlockManagerDecommissioner.scala:120) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$4$adapted(BlockManagerDecommissioner.scala:111) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:111) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Failed to send RPC RPC 5697756267528635203 to /10.240.2.65:43481: java.io.FileNotFoundException: /tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No such file or directory) at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392) at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64) at io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:723) at io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:308) at io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:660) at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:735) at io.netty.channel.AbstractChannel$AbstractUnsafe.handleWriteError(AbstractChannel.java:950) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:933) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:742) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:728) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:750) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:765) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Caused by: java.io.FileNotFoundException: /tmp/blockmgr-98a2a29a-5231-4fed-a82e-6bc0531ad407/15/shuffle_1_356_0.index (No such file or directory) at java.base/java.io.RandomAccessFile.open0(Native Method) at java.base/java.io.RandomAccessFile.open(RandomAccessFile.java:345) at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:259) at java.base/java.io.RandomAccessFile.<init>(RandomAccessFile.java:214) at io.netty.channel.DefaultFileRegion.open(DefaultFileRegion.java:88) at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:128) at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121) at org.apache.spark.network.crypto.TransportCipher$EncryptedMessage.encryptMore(TransportCipher.java:347) at org.apache.spark.network.crypto.TransportCipher$EncryptedMessage.transferTo(TransportCipher.java:310) at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:362) at io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:238) at io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:212) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:400) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:931) ... 17 more 22/08/10 18:05:34 WARN BlockManagerDecommissioner: Stop migrating shuffle blocks to BlockManagerId(0, 10.240.2.65, 43481, None)
This wrapped exception should be handled explicitly, further avoid unnecessary retry of this shuffle block and stop of current migration thread