Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2047

Isolator cleanup failures shouldn't cause TASK_LOST.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.1
    • Component/s: None
    • Labels:
      None

      Description

      Right now, if isolator cleanup fails, we'll transition all pending tasks to TASK_LOST (even in the OOM case, we should have transitioned it to TASK_FAILED).

      The problematic code is here:

      1052 void MesosContainerizerProcess::___destroy(                                                         
      1053     const ContainerID& containerId,
      1054     const Future<Option<int>>& status,
      1055     const Future<list<Future<Nothing>>>& cleanups)
      1056 {   
      1057   // This should not occur because we only use the Future<list> to                                  
      1058   // facilitate chaining.
      1059   CHECK_READY(cleanups);
      1060                  
      1061   // Check cleanup succeeded for all isolators. If not, we'll fail the                              
      1062   // container termination and remove the 'destroying' flag but leave                               
      1063   // all other state. The container is now in an inconsistent state.
      1064   foreach (const Future<Nothing>& cleanup, cleanups.get()) {                                        
      1065     if (!cleanup.isReady()) {                                                                       
      1066       promises[containerId]->fail(                                                                  
      1067         "Failed to clean up an isolator when destroying container '" +                              
      1068         stringify(containerId) + "' :" +
      1069         (cleanup.isFailed() ? cleanup.failure() : "discarded future"));                             
      1070   
      1071       destroying.erase(containerId);
      1072                  
      1073       return;    
      1074     }
      1075   }
      

      Since launcher->destroy already succeeds (all processes are killed), instead of failing the promises[containerId], we probably should just export the error through metrics (so that people can get alerted on that) and still set the termination appropriately.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jieyu Jie Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: