Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
A very very tough bug!
WindowedWordCountITCase#testStreamingSchedulerAndPipeFixedWindow hangs intermittently on my mac. Unfortunately this does not happen on a different (more performant) laptop. I haven't also seen Travis failing due to this bug.
A reliable ( ? ) way to reproduce this bug is to run the test in an infinite loop and wait till the job hangs, which in my case happens around every 1 in 4 runs.
while :; do mvn -Dtest=WindowedWordCountITCase#testStreamingSchedulerAndPipeFixedWindow test; sleep 1; done
The hanging happens at the very first stage, Stage0. Stage0 has 5 tasks and the child Stage1 has 2 tasks. It happens that one of the Stage1 tasks calls into Netty's channel.writeAndFlush() in ContextManager#newContext for all Stage0 tasks to establish connections to read data, but some of the Stage0 tasks fail to obtain the written metadata via ContextManager#channelRead0. I feel as if the metadata written by the Stage1 'disappears' in the network somewhere. As a result those Stage0 tasks hang in waiting for connection requests from the Stage1 task, ultimately hanging the job and the test.
I even tried running the test with the wifi turned off but the issue still persists. The following is a log that I got from one of the test runs. Here, Stage1-1-0 contacts the five parent Stage0 tasks (newInputContext in the log), but we cannot see any logs of Stage0-0-0, Stage0-2-0, Stage0-4-0 getting the contact, whereas we see the logs of Stage0-1-0, Stage0-3-0 (channelRead0 in the log).
grep -nr ‘runtimeEdgeId: “edge4” / dstTaskIndex: 0’ * Node-59-1542778856483/evaluator.stdout:56: INFO 11-21 14:41:08,546 ByteTransfer:67 [pool-2-thread-5] - newInputContext srcTaskIndex: 2 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 / Node-59-1542778856483/evaluator.stdout:61: INFO 11-21 14:41:08,546 ByteTransfer:67 [pool-2-thread-3] - newInputContext srcTaskIndex: 0 / runtimeEdgeId: “edge4” / dstTaskIndex: 0 / numPipeToWait: 2 / Node-59-1542778856483/evaluator.stdout:72: INFO 11-21 14:41:08,547 ByteTransfer:67 [pool-2-thread-4] - newInputContext srcTaskIndex: 1 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 / Node-59-1542778856483/evaluator.stdout:73: INFO 11-21 14:41:08,546 ByteTransfer:67 [pool-2-thread-6] - newInputContext srcTaskIndex: 3 / runtimeEdgeId: “edge4” / dstTaskIndex: 0 / numPipeToWait: 2 / Node-59-1542778856483/evaluator.stdout:105: INFO 11-21 14:41:08,557 ByteTransfer:67 [pool-2-thread-7] - newInputContext srcTaskIndex: 4 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 / Node-59-1542778856483/evaluator.stdout:151: INFO 11-21 14:41:08,583 ContextManager:133 [byte:server:working-4-2] - 5 channelRead0 srcTaskIndex: 3 / runtimeEdgeId: “edge4” / dstTaskIndex: 0 / numPipeToWait: 2 / Node-59-1542778856483/evaluator.stdout:167: INFO 11-21 14:41:08,587 ContextManager:133 [byte:server:working-4-2] - 5 channelRead0 srcTaskIndex: 1 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 /
Attachments
Issue Links
- links to