Uploaded image for project: 'Apache Nemo'
  1. Apache Nemo
  2. NEMO-299

WindowedWordCountITCase Hangs (Heisenbug)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.4

    Description

      A very very tough bug!  

      WindowedWordCountITCase#testStreamingSchedulerAndPipeFixedWindow hangs intermittently on my mac. Unfortunately this does not happen on a different (more performant) laptop. I haven't also seen Travis failing due to this bug.

       

      A reliable ( ? ) way to reproduce this bug is to run the test in an infinite loop and wait till the job hangs, which in my case happens around every 1 in 4 runs.

      while :; do mvn -Dtest=WindowedWordCountITCase#testStreamingSchedulerAndPipeFixedWindow test; sleep 1; done
      
      

       

      The hanging happens at the very first stage, Stage0. Stage0 has 5 tasks and the child Stage1 has 2 tasks. It happens that one of the Stage1 tasks calls into Netty's channel.writeAndFlush() in ContextManager#newContext for all Stage0 tasks to establish connections to read data, but some of the Stage0 tasks fail to obtain the written metadata via ContextManager#channelRead0. I feel as if the metadata written by the Stage1 'disappears' in the network somewhere. As a result those Stage0 tasks hang in waiting for connection requests from the Stage1 task, ultimately hanging the job and the test.

      I even tried running the test with the wifi turned off but the issue still persists. The following is a log that I got from one of the test runs. Here, Stage1-1-0 contacts the five parent Stage0 tasks (newInputContext in the log), but we cannot see any logs of Stage0-0-0, Stage0-2-0, Stage0-4-0 getting the contact, whereas we see the logs of Stage0-1-0, Stage0-3-0 (channelRead0 in the log).

       

      grep -nr ‘runtimeEdgeId: “edge4” / dstTaskIndex: 0’ *
      Node-59-1542778856483/evaluator.stdout:56: INFO 11-21 14:41:08,546 ByteTransfer:67 [pool-2-thread-5] - newInputContext srcTaskIndex: 2 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 /
      Node-59-1542778856483/evaluator.stdout:61: INFO 11-21 14:41:08,546 ByteTransfer:67 [pool-2-thread-3] - newInputContext srcTaskIndex: 0 / runtimeEdgeId: “edge4” / dstTaskIndex: 0 / numPipeToWait: 2 /
      Node-59-1542778856483/evaluator.stdout:72: INFO 11-21 14:41:08,547 ByteTransfer:67 [pool-2-thread-4] - newInputContext srcTaskIndex: 1 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 /
      Node-59-1542778856483/evaluator.stdout:73: INFO 11-21 14:41:08,546 ByteTransfer:67 [pool-2-thread-6] - newInputContext srcTaskIndex: 3 / runtimeEdgeId: “edge4” / dstTaskIndex: 0 / numPipeToWait: 2 /
      Node-59-1542778856483/evaluator.stdout:105: INFO 11-21 14:41:08,557 ByteTransfer:67 [pool-2-thread-7] - newInputContext srcTaskIndex: 4 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 /
      Node-59-1542778856483/evaluator.stdout:151: INFO 11-21 14:41:08,583 ContextManager:133 [byte:server:working-4-2] - 5 channelRead0 srcTaskIndex: 3 / runtimeEdgeId: “edge4” / dstTaskIndex: 0 / numPipeToWait: 2 /
      Node-59-1542778856483/evaluator.stdout:167: INFO 11-21 14:41:08,587 ContextManager:133 [byte:server:working-4-2] - 5 channelRead0 srcTaskIndex: 1 / runtimeEdgeId: “edge4" / dstTaskIndex: 0 / numPipeToWait: 2 /
      
      

       

      Attachments

        Issue Links

          Activity

            People

              wonook Wonook
              johnyangk John Yang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: