Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.12.0
-
None
Description
Master cron job is currently failing because of a deadlock introduced in FLINK-19026.
2020-09-23T21:50:39.2444176Z Found one Java-level deadlock: 2020-09-23T21:50:39.2444633Z ============================= 2020-09-23T21:50:39.2445001Z "Temp writer": 2020-09-23T21:50:39.2445484Z waiting to lock monitor 0x00007f4e14004ca8 (object 0x0000000086501948, a java.lang.Object), 2020-09-23T21:50:39.2446418Z which is held by "flink-akka.actor.default-dispatcher-2" 2020-09-23T21:50:39.2447193Z "flink-akka.actor.default-dispatcher-2": 2020-09-23T21:50:39.2447903Z waiting to lock monitor 0x00007f4e14004bf8 (object 0x0000000086501930, a org.apache.flink.runtime.io.network.partition.PrioritizedDeque), 2020-09-23T21:50:39.2448703Z which is held by "Temp writer" 2020-09-23T21:50:39.2448965Z 2020-09-23T21:50:39.2449384Z Java stack information for the threads listed above: 2020-09-23T21:50:39.2449900Z =================================================== 2020-09-23T21:50:39.2450325Z "Temp writer": 2020-09-23T21:50:39.2451050Z at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.checkAndWaitForSubpartitionView(LocalInputChannel.java:244) 2020-09-23T21:50:39.2452264Z - waiting to lock <0x0000000086501948> (a java.lang.Object) 2020-09-23T21:50:39.2453183Z at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:205) 2020-09-23T21:50:39.2454173Z at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:642) 2020-09-23T21:50:39.2455422Z - locked <0x0000000086501930> (a org.apache.flink.runtime.io.network.partition.PrioritizedDeque) 2020-09-23T21:50:39.2456310Z at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:619) 2020-09-23T21:50:39.2457311Z at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNext(SingleInputGate.java:602) 2020-09-23T21:50:39.2458205Z at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.getNext(InputGateWithMetrics.java:105) 2020-09-23T21:50:39.2459258Z at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:100) 2020-09-23T21:50:39.2460465Z at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47) 2020-09-23T21:50:39.2461344Z at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59) 2020-09-23T21:50:39.2462164Z at org.apache.flink.runtime.operators.TempBarrier$TempWritingThread.run(TempBarrier.java:178) 2020-09-23T21:50:39.2463418Z "flink-akka.actor.default-dispatcher-2": 2020-09-23T21:50:39.2464109Z at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.queueChannel(SingleInputGate.java:825) 2020-09-23T21:50:39.2465336Z - waiting to lock <0x0000000086501930> (a org.apache.flink.runtime.io.network.partition.PrioritizedDeque) 2020-09-23T21:50:39.2466228Z at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.notifyChannelNonEmpty(SingleInputGate.java:791) 2020-09-23T21:50:39.2467222Z at org.apache.flink.runtime.io.network.partition.consumer.InputChannel.notifyChannelNonEmpty(InputChannel.java:154) 2020-09-23T21:50:39.2468212Z at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.notifyDataAvailable(LocalInputChannel.java:236) 2020-09-23T21:50:39.2469577Z at org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:76) 2020-09-23T21:50:39.2470607Z at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:133) 2020-09-23T21:50:39.2471765Z - locked <0x0000000086501948> (a java.lang.Object) 2020-09-23T21:50:39.2472685Z at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.updateInputChannel(SingleInputGate.java:489) 2020-09-23T21:50:39.2473727Z - locked <0x0000000086532500> (a java.lang.Object) 2020-09-23T21:50:39.2474449Z at org.apache.flink.runtime.io.network.NettyShuffleEnvironment.updatePartitionInfo(NettyShuffleEnvironment.java:279) 2020-09-23T21:50:39.2475394Z at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$updatePartitions$12(TaskExecutor.java:758) 2020-09-23T21:50:39.2476235Z at org.apache.flink.runtime.taskexecutor.TaskExecutor$$Lambda$406/1860601696.run(Unknown Source) 2020-09-23T21:50:39.2476973Z at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) 2020-09-23T21:50:39.2477714Z at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 2020-09-23T21:50:39.2478698Z at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 2020-09-23T21:50:39.2479506Z at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 2020-09-23T21:50:39.2480263Z at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 2020-09-23T21:50:39.2481018Z at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 2020-09-23T21:50:39.2481727Z at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2020-09-23T21:50:39.2482192Z
The deadlock was introduced by the alternative fix for FLINK-12510 in FLINK-19026 .
The alternative fix avoided double lock acquisition in SingleInputGate#waitAndGetNextData by moving notifyDataAvailable from ResultSubpartition#createReadView to ResultPartitionManager#createSubpartitionView, which solved the circular deadlock of FLINK-12510 by not acquiring the buffers lock of PipelinedSubpartition.
However, that fix didn't go far enough. For local channels, it is still possible to create a similar deadlock. While SingleInputGate reads from LocalInputChannel, it holds the lock inputChannelsWithData. The channel may start requesting the partition and tries to acquire requestLock.
At the same time there is an update of partition info, which acquires the requestLock and notifies the SingleInputGate, which needs to lock on inputChannelsWithData.
The solution is to first acquire the partition and release requestLock before notifying SingleInputGate.
Attachments
Issue Links
- contains
-
FLINK-19387 ConnectedComponentsWithDeferredUpdateITCase.testJobWithoutObjectReuse deadlock
- Resolved
- is caused by
-
FLINK-19026 Improve threading model of Unaligner
- Resolved
- links to