Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2047

Isolator cleanup failures shouldn't cause TASK_LOST.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.21.0
    • 0.21.1
    • None
    • None

    Description

      Right now, if isolator cleanup fails, we'll transition all pending tasks to TASK_LOST (even in the OOM case, we should have transitioned it to TASK_FAILED).

      The problematic code is here:

      1052 void MesosContainerizerProcess::___destroy(                                                         
      1053     const ContainerID& containerId,
      1054     const Future<Option<int>>& status,
      1055     const Future<list<Future<Nothing>>>& cleanups)
      1056 {   
      1057   // This should not occur because we only use the Future<list> to                                  
      1058   // facilitate chaining.
      1059   CHECK_READY(cleanups);
      1060                  
      1061   // Check cleanup succeeded for all isolators. If not, we'll fail the                              
      1062   // container termination and remove the 'destroying' flag but leave                               
      1063   // all other state. The container is now in an inconsistent state.
      1064   foreach (const Future<Nothing>& cleanup, cleanups.get()) {                                        
      1065     if (!cleanup.isReady()) {                                                                       
      1066       promises[containerId]->fail(                                                                  
      1067         "Failed to clean up an isolator when destroying container '" +                              
      1068         stringify(containerId) + "' :" +
      1069         (cleanup.isFailed() ? cleanup.failure() : "discarded future"));                             
      1070   
      1071       destroying.erase(containerId);
      1072                  
      1073       return;    
      1074     }
      1075   }
      

      Since launcher->destroy already succeeds (all processes are killed), instead of failing the promises[containerId], we probably should just export the error through metrics (so that people can get alerted on that) and still set the termination appropriately.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jieyu Jie Yu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: