Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.10.0, 1.10.1, 1.11.0
Description
Partitions are released in the main thread of the TaskExecutor (see the stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. because deleting files is blocking I/O. The partitions should be released in a devoted I/O thread pool (TaskExecutor#ioExecutor is a candidate but requires a higher default thread count).
2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 prio=5 os_prio=0 tid=0x00007f7fcc071000 nid=0x1f3f9 runnable [0x00007f7fd302c000] 2020-05-06T19:13:12.4383983Z java.lang.Thread.State: RUNNABLE 2020-05-06T19:13:12.4384519Z at sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method) 2020-05-06T19:13:12.4384971Z at sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146) 2020-05-06T19:13:12.4385465Z at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231) 2020-05-06T19:13:12.4386000Z at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) 2020-05-06T19:13:12.4386458Z at java.nio.file.Files.delete(Files.java:1126) 2020-05-06T19:13:12.4386968Z at org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93) 2020-05-06T19:13:12.4388088Z at org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247) 2020-05-06T19:13:12.4388765Z at org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208) 2020-05-06T19:13:12.4389444Z - locked <0x00000000ff836d78> (a java.lang.Object) 2020-05-06T19:13:12.4389905Z at org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290) 2020-05-06T19:13:12.4390481Z at org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80) 2020-05-06T19:13:12.4391118Z - locked <0x000000009d452b90> (a java.util.HashMap) 2020-05-06T19:13:12.4391597Z at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153) 2020-05-06T19:13:12.4392267Z at org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62) 2020-05-06T19:13:12.4392914Z at org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776) 2020-05-06T19:13:12.4393366Z at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 2020-05-06T19:13:12.4393813Z at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-05-06T19:13:12.4394257Z at java.lang.reflect.Method.invoke(Method.java:498) 2020-05-06T19:13:12.4394693Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279) 2020-05-06T19:13:12.4395202Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) 2020-05-06T19:13:12.4395686Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) 2020-05-06T19:13:12.4396165Z at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown Source) 2020-05-06T19:13:12.4396606Z at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 2020-05-06T19:13:12.4397015Z at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 2020-05-06T19:13:12.4397447Z at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) 2020-05-06T19:13:12.4397874Z at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 2020-05-06T19:13:12.4398414Z at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) 2020-05-06T19:13:12.4398879Z at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399321Z at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 2020-05-06T19:13:12.4399737Z at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 2020-05-06T19:13:12.4400138Z at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 2020-05-06T19:13:12.4400552Z at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 2020-05-06T19:13:12.4400930Z at akka.actor.ActorCell.invoke(ActorCell.scala:561) 2020-05-06T19:13:12.4401390Z at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 2020-05-06T19:13:12.4401763Z at akka.dispatch.Mailbox.run(Mailbox.scala:225) 2020-05-06T19:13:12.4402135Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 2020-05-06T19:13:12.4402540Z at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 2020-05-06T19:13:12.4402984Z at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 2020-05-06T19:13:12.4403448Z at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 2020-05-06T19:13:12.4404096Z at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Attachments
Issue Links
- causes
-
FLINK-18035 Executors#newCachedThreadPool could not work as expected
- Closed
- is related to
-
FLINK-17194 TPC-DS end-to-end test fails due to missing execution attempt
- Closed
- is required by
-
FLINK-17621 Use default akka.ask.timeout in TPC-DS e2e test
- Closed
- links to