Details
-
Story
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
There needs to be a standard process for customizing the SLA used to validate a task on a host can be killed to drain that host into maintenance. Right now, the default is 95% over 30minutes, but there are certain services (such as memcache) which would be able to survive much better under a 99% over 5 minutes, for example.
We could build this tooling around the existing aurora_admin drain_hosts, but it would apply to all tasks on that host, which would increase complexity.
Lastly, in case we decide to make this user-settable vs. operator-whitelistable.. t is important that we still set firm barriers in place around acceptable values to prevent a service from setting 100% over 0 minutes and holding hosts hostage.