XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.12
    • None

    Description

      I've created a Sub-task, as it's not clear this is the ONLY timed out job we are looking for. This happens while running

      Fail_AllocatedEvaluator

      . It's not always reproducible on every run, but something about my current setup/machine is producing it more times than not when building from scratch.

      I've isolated a deadlock, and I will attach the stack trace from driver.stdout. [1]

      Basically: RuntimeClock.run (which holds a lock on schedule) is triggering an idle check, but the idle check can't progress because the lock for DriverStatusManager is held. This is because an error was triggered. The error wants to stop RuntimeClock, but is waiting to get the lock on schedule.

      The lock for idle check originates from https://github.com/Microsoft-CISL/REEF/pull/1022/. We'll have to figure out what needs to be done.

      [1] For those interested, I set the LocalTestEnvironment timeout to a high value; then when I noticed the job stalling, I did

      kill -3 <PID>

      which triggers the stack trace to driver.stdout.

      Attachments

        1. stacktrace.txt
          13 kB
          Brian Cho

        Activity

          People

            Unassigned Unassigned
            chobrian Brian Cho
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: