[MESOS-9502] IOswitchboard cleanup could get stuck due to FD leak from a race. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.7.0
Fix Version/s: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
Component/s: containerization
Labels:
- containerizer

Epic Link:
Container Attach/Exec Improvements
Sprint:
Containerization R9 Sprint 37
Story Points:
8

Description

Our check container got stuck during destroy which in turned stucks the parent container. It is blocked by the I/O switchboard cleanup:

1223 18:04:41.000000 16269 switchboard.cpp:814] Sending SIGTERM to I/O switchboard server (pid: 62854) since container 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e is being destroyed
....
1227 04:45:38.000000 5189 switchboard.cpp:916] I/O switchboard server process for container 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e has terminated (status=N/A)

Note the timestamp.

Root Cause:
Fundamentally, this is caused by a race between .discard() triggered by Check Container TIMEOUT and IOSB extracting ContainerIO object. This race could be exposed by overloaded/slow agent process. Please see how this race be triggered below:

Right after IOSB server process is running, Check container Timed out and the checker process returns a failure, which would close the HTTP connection with agent.
From the agent side, if the connection breaks, the handler will trigger a discard on the returned future and that will result in containerizer->launch()'s future transitioned to DISCARDED state.
In containerizer, the DISCARDED state will be propagated back to IOSB prepare(), which stop its continuation on extracting the containerIO (it implies the object being cleaned up and FDs(one end of pipes created in IOSB) being closed in its destructor).
Agent starts to destroy the container due to its discarded launch result, and asks IOSB to cleanup the container.
IOSB server is still running, so agent sends a SIGTERM.
SIGTERM handler unblocks the IOSB from redirecting (to redirect stdout/stderr from container to logger before exiting).
io::redirect() calls io::splice() and reads the other end of those pipes forever.

This issue is not easy to reproduce unless on a busy agent, because the timeout has to happen exactly AFTER IOSB server is running and BEFORE IOSB extracts containerIO.

Attachments

Issue Links

duplicates

MESOS-6632 ContainerLogger might leak FD if container launch fails.

Resolved

is related to

MESOS-7121 Make IO Switchboard optional for debug containers

Open

Activity

People

Assignee:: Andrei Budnik

Reporter:: Meng Zhu

Shepherd:: Gilbert Song

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Dec/18 06:17

Updated:: 11/Jan/19 21:06

Resolved:: 09/Jan/19 19:39