Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.16
    • REEF.NET
    • None

    Description

      O.A.R.Tests.Functional.IMRU.IMRUCloseTaskTest.TestTaskCloseOnLocalRuntime fails frequently in AppVeyor test runs. The error is typically "Cannot read from log file" with rest runtime 65-70 seconds, and according to our current thinking in REEF-1417 this is caused by Evaluator(s) still running after 60 seconds of test execution.

      In one case the test completed in 38 seconds, with error message

      Assert.Equal() Failure
      Expected: 4
      Actual:   5
      at 
      Assert.Equal(numTasks, failedCount + completedCount);
      

      In successful run, the test takes 30 seconds.

      We need to investigate whether IMRU job itself takes longer than 60 seconds on AppVeyor machines, or whether Evaluator doesn't close properly.

      Attachments

        Activity

          Looks like this has been fixed with REEF-1466. Let's observe for some time whether it reoccurs.

          MariiaMykhailova Mariia Mykhailova added a comment - Looks like this has been fixed with REEF-1466 . Let's observe for some time whether it reoccurs.

          The issue reproduces quite often during the past months, so we can't count this as fixed. Let's observe now whether resolving REEF-1482 fixed this.

          MariiaMykhailova Mariia Mykhailova added a comment - The issue reproduces quite often during the past months, so we can't count this as fixed. Let's observe now whether resolving REEF-1482 fixed this.

          The issue reproduces with the fix for REEF-1482: https://ci.appveyor.com/project/ApacheSoftwareFoundation/reef/build/1146-master/job/8ss7f32vgnhgqi8n/tests
          So we don't get this resolved for free

          MariiaMykhailova Mariia Mykhailova added a comment - The issue reproduces with the fix for REEF-1482 : https://ci.appveyor.com/project/ApacheSoftwareFoundation/reef/build/1146-master/job/8ss7f32vgnhgqi8n/tests So we don't get this resolved for free

          Investigation so far: in problematic cases IMRU master task takes about 7 minutes to complete (after dispose command is sent to it). In good cases even if master is the last task to complete it just takes several seconds.

          MariiaMykhailova Mariia Mykhailova added a comment - Investigation so far: in problematic cases IMRU master task takes about 7 minutes to complete (after dispose command is sent to it). In good cases even if master is the last task to complete it just takes several seconds.

          More details: I've isolated logs from driver and IMRU master task for successful test run and failure.

          In both cases IMRU master task receives close event. But in successful case, it gets an exception quickly and proceeds to return with cancellation token = true.

          ERROR: Received exception in UpdateTaskHost with cancellation token True: [System.OperationCanceledException: GetData operation is canceled

          In failure case, task enters a loop trying to connect to one of the tasks which are already closed, and spends the extra 7 minutes doing 200 attempts to connect.

          Org.Apache.REEF.Wake.Remote.Impl.RemoteConnectionRetryHandler Information: 0 : 2016-12-02T20:53:06.7473517+00:00 0011
          INFO: Retry - Count:1, Delay:00:00:01, Exception:System.Net.Sockets.SocketException (0x80004005): No connection could be made because the target machine actively refused it 127.0.0.1:9458
          ...
          Org.Apache.REEF.Wake.Remote.Impl.RemoteConnectionRetryHandler Information: 0 : 2016-12-02T20:59:48.7139088+00:00 0011
          INFO: Retry - Count:200, Delay:00:00:01, Exception:System.Net.Sockets.SocketException (0x80004005): No connection could be made because the target machine actively refused it 127.0.0.1:9458
          ...
          ERROR: Received exception in UpdateTaskHost with cancellation token True: [Org.Apache.REEF.Wake.Remote.Impl.TcpClientConnectionException: Retried 200 times but connection to endpoint 127.0.0.1:9458 failed, RetriesDone=200 ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:9458
          

          It looks like our cancellation works differently when the master task is in different stages of an iteration. Need to look into this further, but so far it looks like a genuine problem with IMRU code, not just a poorly written test.

          MariiaMykhailova Mariia Mykhailova added a comment - More details: I've isolated logs from driver and IMRU master task for successful test run and failure . In both cases IMRU master task receives close event. But in successful case, it gets an exception quickly and proceeds to return with cancellation token = true. ERROR: Received exception in UpdateTaskHost with cancellation token True: [System.OperationCanceledException: GetData operation is canceled In failure case, task enters a loop trying to connect to one of the tasks which are already closed, and spends the extra 7 minutes doing 200 attempts to connect. Org.Apache.REEF.Wake.Remote.Impl.RemoteConnectionRetryHandler Information: 0 : 2016-12-02T20:53:06.7473517+00:00 0011 INFO: Retry - Count:1, Delay:00:00:01, Exception:System.Net.Sockets.SocketException (0x80004005): No connection could be made because the target machine actively refused it 127.0.0.1:9458 ... Org.Apache.REEF.Wake.Remote.Impl.RemoteConnectionRetryHandler Information: 0 : 2016-12-02T20:59:48.7139088+00:00 0011 INFO: Retry - Count:200, Delay:00:00:01, Exception:System.Net.Sockets.SocketException (0x80004005): No connection could be made because the target machine actively refused it 127.0.0.1:9458 ... ERROR: Received exception in UpdateTaskHost with cancellation token True: [Org.Apache.REEF.Wake.Remote.Impl.TcpClientConnectionException: Retried 200 times but connection to endpoint 127.0.0.1:9458 failed, RetriesDone=200 ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:9458 It looks like our cancellation works differently when the master task is in different stages of an iteration. Need to look into this further, but so far it looks like a genuine problem with IMRU code, not just a poorly written test.

          I've filed REEF-1684 for the proper fix. As a mitigation, I'll configure MaxConnectionRetry and SleepTime to smaller values for this test, so that even this test hits scenario when task cancellation goes through all the retries, it will be done sooner.

          MariiaMykhailova Mariia Mykhailova added a comment - I've filed REEF-1684 for the proper fix. As a mitigation, I'll configure MaxConnectionRetry and SleepTime to smaller values for this test, so that even this test hits scenario when task cancellation goes through all the retries, it will be done sooner.
          MariiaMykhailova Mariia Mykhailova added a comment - PR https://github.com/apache/reef/pull/1199/files
          juliaw Julia Wang added a comment - Resolved via https://github.com/apache/reef/pull/1199/files

          People

            MariiaMykhailova Mariia Mykhailova
            MariiaMykhailova Mariia Mykhailova
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: