Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.21.0
-
None
-
None
Description
Right now, if isolator cleanup fails, we'll transition all pending tasks to TASK_LOST (even in the OOM case, we should have transitioned it to TASK_FAILED).
The problematic code is here:
1052 void MesosContainerizerProcess::___destroy( 1053 const ContainerID& containerId, 1054 const Future<Option<int>>& status, 1055 const Future<list<Future<Nothing>>>& cleanups) 1056 { 1057 // This should not occur because we only use the Future<list> to 1058 // facilitate chaining. 1059 CHECK_READY(cleanups); 1060 1061 // Check cleanup succeeded for all isolators. If not, we'll fail the 1062 // container termination and remove the 'destroying' flag but leave 1063 // all other state. The container is now in an inconsistent state. 1064 foreach (const Future<Nothing>& cleanup, cleanups.get()) { 1065 if (!cleanup.isReady()) { 1066 promises[containerId]->fail( 1067 "Failed to clean up an isolator when destroying container '" + 1068 stringify(containerId) + "' :" + 1069 (cleanup.isFailed() ? cleanup.failure() : "discarded future")); 1070 1071 destroying.erase(containerId); 1072 1073 return; 1074 } 1075 }
Since launcher->destroy already succeeds (all processes are killed), instead of failing the promises[containerId], we probably should just export the error through metrics (so that people can get alerted on that) and still set the termination appropriately.