Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
Concurrent release and re-triggering of a partition request can lead to a deadlock.
Found one Java-level deadlock: ============================= "Canceler for Map -> Sink: Unnamed (1/4)": waiting to lock monitor 0x0000000001e27bd8 (object 0x00000000ffa1f688, a java.lang.Object), which is held by "Timer-3" "Timer-3": waiting to lock monitor 0x00007fdbd029ec48 (object 0x00000000ffa1f3a0, a java.lang.Object), which is held by "Canceler for Map -> Sink: Unnamed (1/4)" Java stack information for the threads listed above: =================================================== "Canceler for Map -> Sink: Unnamed (1/4)": at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.releaseAllResources(LocalInputChannel.java:240) - waiting to lock <0x00000000ffa1f688> (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.releaseAllResources(SingleInputGate.java:348) - locked <0x00000000ffa1f3a0> (a java.lang.Object) at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1280) at java.lang.Thread.run(Thread.java:745) "Timer-3": at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:307) - waiting to lock <0x00000000ffa1f3a0> (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:128) - locked <0x00000000ffa1f688> (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:148) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505)