Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7152

The agent may be flapping after the machine reboots due to provisioner recover.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.2, 1.2.0
    • Component/s: None

      Description

      After the agent machine reboots, if the agent work dir survives (e.g., /var/lib/mesos) and the container runtime directory is gone (an empty SlaveState as well), the provisioner recover() would get into segfault because that case break the semantic that a child container should always be cleaned up before it parent container.

      This is a particular case which only happens if the machine reboots and the provisioner directory survives.

      F0217 01:10:18.423238 30099 provisioner.cpp:504] Check failed: entry.parent() != containerId Failed to destroy container 1 since its nested container 1.2 has not been destroyed yet
      *** Check failure stack trace: ***
          @     0x7fceb444121d  google::LogMessage::Fail()
          @     0x7fceb44405ee  google::LogMessage::SendToLog()
          @     0x7fceb4440eed  google::LogMessage::Flush()
          @     0x7fceb4444368  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fceb36137f9  mesos::internal::slave::ProvisionerProcess::destroy()
          @     0x7fceb36126f0  mesos::internal::slave::ProvisionerProcess::recover()
          @     0x7fceb3637fc6  _ZZN7process8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS2_11ContainerIDESt4hashIS7_ESt8equal_toIS7_EESC_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESS_
          @     0x7fceb3637bc2  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS6_11ContainerIDESt4hashISB_ESt8equal_toISB_EESG_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSN_FSL_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
          @     0x7fceb43848e4  std::function<>::operator()()
          @     0x7fceb436baf4  process::ProcessBase::visit()
          @     0x7fceb43e5fde  process::DispatchEvent::visit()
          @           0x9e4101  process::ProcessBase::serve()
          @     0x7fceb4369007  process::ProcessManager::resume()
          @     0x7fceb4377a8c  process::ProcessManager::init_threads()::$_2::operator()()
          @     0x7fceb4377995  _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
          @     0x7fceb4377965  std::_Bind_simple<>::operator()()
          @     0x7fceb437793c  std::thread::_Impl<>::_M_run()
          @     0x7fceadefa030  (unknown)
          @     0x7fcead70b6aa  start_thread
          @     0x7fcead440e9d  (unknown)
      

      The provisioner directory is supposed to be under the container runtime directory. However, this is not backward compatible. We can only change it after a deprecation cycle.

      For now, we have to three options:
      1. make provisioner::destroy() recursive.
      2. sort the container during recovery to guarantee `child before parent` semantic.
      3. remove the check-failure since the while provisioner dir will be removed eventually at the end (not recommended).

      Recommend (1).

        Attachments

          Activity

            People

            • Assignee:
              gilbert Gilbert Song
              Reporter:
              gilbert Gilbert Song
              Shepherd:
              Jie Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: