[FLINK-2134] Deadlock in SuccessAfterNetworkBuffersFailureITCase - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.10.0
Fix Version/s: 0.9
Component/s: None
Labels:
None

Description

I ran into the issue in a Travis run for a PR: https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt

I can reproduce this locally by running SuccessAfterNetworkBuffersFailureITCase multiple times:

cluster = new ForkableFlinkMiniCluster(config, false);
for (int i = 0; i < 100; i++) {
   // run test programs CC, KMeans, CC
}

The iteration tasks wait for superstep notifications like this:

"Join (Join at runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128)) (8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait() [0x0000000123f2a000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000007f89e3440> (a java.lang.Object)
	at org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57)
	- locked <0x00000007f89e3440> (a java.lang.Object)
	at org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131)
	at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
	at java.lang.Thread.run(Thread.java:745)

I've asked rmetzger to reproduce this and it deadlocks for him as well. The system needs to be under some load for this to occur after multiple runs.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ufuk Celebi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Jun/15 12:39

Updated:: 06/Oct/15 15:51

Resolved:: 04/Jun/15 09:18