Previously all containers from Mesos containerizer uses same 1 minute timeout for destroying cgroup. However, we have observed that for certain containers (possibly with deep system calls), the cgroup hierarchy was not destroyed within that timeout. The is quite problematic because containerizer short-circuits the destroy routine and skips isolator::cleanup. We have observed that GPU resources got leaked indefinitely due to such a bug (see MESOS-8038).
The proposed workaround here is to add an optional agent flag to allow operator to override this timeout.