Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.10.0
-
None
-
None
Description
I ran into the issue in a Travis run for a PR: https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt
I can reproduce this locally by running SuccessAfterNetworkBuffersFailureITCase multiple times:
cluster = new ForkableFlinkMiniCluster(config, false); for (int i = 0; i < 100; i++) { // run test programs CC, KMeans, CC }
The iteration tasks wait for superstep notifications like this:
"Join (Join at runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128)) (8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait() [0x0000000123f2a000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000007f89e3440> (a java.lang.Object) at org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57) - locked <0x00000007f89e3440> (a java.lang.Object) at org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745)
I've asked rmetzger to reproduce this and it deadlocks for him as well. The system needs to be under some load for this to occur after multiple runs.