Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1791

Commit ca683 is not backwards compatible.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 0.17.0
    • None
    • None

    Description

      The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards compatible. The last section of the commit

      4. Modified the Health Checker and redefined the meaning initial_interval_secs.

      has serious, unintended consequences.

      Consider the following health check config:

            initial_interval_secs: 10
            interval_secs: 5
            max_consecutive_failures: 1
      

      On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second.

      On master, health checking starts right away which means the task can fail at the first second since max_consecutive_failures is set to 1.

      This is not backwards compatible and needs to be fixed.

      I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when max_consecutive_successes is met.

      An investigation shows initial_interval_secs was set to 5 but the task failed health checks right away:

      D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health check.
      D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter.
      D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
      W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes.
      

      Attachments

        Issue Links

          Activity

            People

              kaih Kai Huang
              zmanji Zameer Manji
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: