[MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.21.0
Fix Version/s: 0.21.1
Component/s: None
Labels:
None

Description

Right now, if isolator cleanup fails, we'll transition all pending tasks to TASK_LOST (even in the OOM case, we should have transitioned it to TASK_FAILED).

The problematic code is here:

1052 void MesosContainerizerProcess::___destroy(                                                         
1053     const ContainerID& containerId,
1054     const Future<Option<int>>& status,
1055     const Future<list<Future<Nothing>>>& cleanups)
1056 {   
1057   // This should not occur because we only use the Future<list> to                                  
1058   // facilitate chaining.
1059   CHECK_READY(cleanups);
1060                  
1061   // Check cleanup succeeded for all isolators. If not, we'll fail the                              
1062   // container termination and remove the 'destroying' flag but leave                               
1063   // all other state. The container is now in an inconsistent state.
1064   foreach (const Future<Nothing>& cleanup, cleanups.get()) {                                        
1065     if (!cleanup.isReady()) {                                                                       
1066       promises[containerId]->fail(                                                                  
1067         "Failed to clean up an isolator when destroying container '" +                              
1068         stringify(containerId) + "' :" +
1069         (cleanup.isFailed() ? cleanup.failure() : "discarded future"));                             
1070   
1071       destroying.erase(containerId);
1072                  
1073       return;    
1074     }
1075   }

Since launcher->destroy already succeeds (all processes are killed), instead of failing the promises[containerId], we probably should just export the error through metrics (so that people can get alerted on that) and still set the termination appropriately.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jie Yu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Nov/14 19:22

Updated:: 02/Jul/15 18:43

Resolved:: 06/Dec/14 17:56