Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7152

The agent may be flapping after the machine reboots due to provisioner recover.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 1.1.2, 1.2.0
    • None

    Description

      After the agent machine reboots, if the agent work dir survives (e.g., /var/lib/mesos) and the container runtime directory is gone (an empty SlaveState as well), the provisioner recover() would get into segfault because that case break the semantic that a child container should always be cleaned up before it parent container.

      This is a particular case which only happens if the machine reboots and the provisioner directory survives.

      F0217 01:10:18.423238 30099 provisioner.cpp:504] Check failed: entry.parent() != containerId Failed to destroy container 1 since its nested container 1.2 has not been destroyed yet
      *** Check failure stack trace: ***
          @     0x7fceb444121d  google::LogMessage::Fail()
          @     0x7fceb44405ee  google::LogMessage::SendToLog()
          @     0x7fceb4440eed  google::LogMessage::Flush()
          @     0x7fceb4444368  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fceb36137f9  mesos::internal::slave::ProvisionerProcess::destroy()
          @     0x7fceb36126f0  mesos::internal::slave::ProvisionerProcess::recover()
          @     0x7fceb3637fc6  _ZZN7process8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS2_11ContainerIDESt4hashIS7_ESt8equal_toIS7_EESC_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESS_
          @     0x7fceb3637bc2  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS6_11ContainerIDESt4hashISB_ESt8equal_toISB_EESG_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSN_FSL_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
          @     0x7fceb43848e4  std::function<>::operator()()
          @     0x7fceb436baf4  process::ProcessBase::visit()
          @     0x7fceb43e5fde  process::DispatchEvent::visit()
          @           0x9e4101  process::ProcessBase::serve()
          @     0x7fceb4369007  process::ProcessManager::resume()
          @     0x7fceb4377a8c  process::ProcessManager::init_threads()::$_2::operator()()
          @     0x7fceb4377995  _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
          @     0x7fceb4377965  std::_Bind_simple<>::operator()()
          @     0x7fceb437793c  std::thread::_Impl<>::_M_run()
          @     0x7fceadefa030  (unknown)
          @     0x7fcead70b6aa  start_thread
          @     0x7fcead440e9d  (unknown)
      

      The provisioner directory is supposed to be under the container runtime directory. However, this is not backward compatible. We can only change it after a deprecation cycle.

      For now, we have to three options:
      1. make provisioner::destroy() recursive.
      2. sort the container during recovery to guarantee `child before parent` semantic.
      3. remove the check-failure since the while provisioner dir will be removed eventually at the end (not recommended).

      Recommend (1).

      Attachments

        Activity

          People

            gilbert Gilbert Song
            gilbert Gilbert Song
            Jie Yu Jie Yu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: