[MESOS-9148] Make cgroups destroy timeout configurable for Mesos containerizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.0
Component/s: None
Labels:
None

Target Version/s:

1.7.0

Description

Previously all containers from Mesos containerizer uses same 1 minute timeout for destroying cgroup. However, we have observed that for certain containers (possibly with deep system calls), the cgroup hierarchy was not destroyed within that timeout. The is quite problematic because containerizer short-circuits the destroy routine and skips isolator::cleanup. We have observed that GPU resources got leaked indefinitely due to such a bug (see MESOS-8038).

The proposed workaround here is to add an optional agent flag to allow operator to override this timeout.

Attachments

Activity

People

Assignee:: Zhitao Li

Reporter:: Zhitao Li

Shepherd:: Gilbert Song

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Aug/18 17:47

Updated:: 10/Aug/18 17:58

Resolved:: 10/Aug/18 17:58