Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
We were testing an upgrade scenario recently and encountered the following assertion failure:
Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.693977 20810 http.cpp:3116] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f' Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.695179 20807 containerizer.cpp:1169] Trying to chown '/var/lib/mesos/slave/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S12/frameworks/dcf5f8b5-86a8-44df-ac03-b39404239ad8-0377/executors/kafka__68baefd4-aa8c-4b97-a23e-eb6a73fa91f6/runs/a89b211a-4549-462d-9cc7-0ea2bac2f729/containers/1c262420-7525-4fee-99c1-aff4f66996bd/containers/check-a41362ae-13c6-4750-990e-a1a0b2792b5f' to user 'nobody' Dec 12 16:45:42 agent.hostname mesos-agent[20788]: W1212 16:45:42.695309 20807 containerizer.cpp:1198] Cannot determine executor_info for root container 'a89b211a-4549-462d-9cc7-0ea2bac2f729' which has no config recovered. Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.695327 20807 containerizer.cpp:1203] Starting container a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.695829 20807 containerizer.cpp:2932] Transitioning the state of container a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f from PROVISIONING to PREPARING Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.700569 20811 systemd.cpp:98] Assigned child process '20941' to 'mesos_executors.slice' Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.702945 20811 systemd.cpp:98] Assigned child process '20942' to 'mesos_executors.slice' Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.706069 20806 switchboard.cpp:575] Created I/O switchboard server (pid: 20943) listening on socket file '/tmp/mesos-io-switchboard-74af71bb-2385-4dde-9762-94d0196124d3' for container a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f Dec 12 16:45:42 agent.hostname mesos-agent[20788]: mesos-agent: /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:115: T& Option<T>::get() & [with T = mesos::slave::ContainerConfig]: Assertion `isSome()' failed. Dec 12 16:45:42 agent.hostname mesos-agent[20788]: *** Aborted at 1513097142 (unix time) try "date -d @1513097142" if you are using GNU date *** Dec 12 16:45:42 agent.hostname mesos-agent[20788]: PC: @ 0x7f472f2851f7 __GI_raise Dec 12 16:45:42 agent.hostname mesos-agent[20788]: *** SIGABRT (@0x5134) received by PID 20788 (TID 0x7f472a2bf700) from PID 20788; stack trace: *** Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f6225e0 (unknown) Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f2851f7 __GI_raise Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f2868e8 __GI_abort Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f27e266 __assert_fail_base Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f27e312 __GI___assert_fail Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f4731c481e3 _ZNR6OptionIN5mesos5slave15ContainerConfigEE3getEv.part.170 Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f4731c61c2d mesos::internal::slave::MesosContainerizerProcess::_launch() Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f4731c7f403 _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave13Containerizer12LaunchResultENS5_25MesosContainerizerProcessERKNS3_11ContainerIDERK6OptionINS3_5slave11ContainerIOEERKSt3mapISsSsSt4lessISsESaISt4pairIKSsSsEEERKSC_ISsESB_SH_SR_SU_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSX_T1_T2_T3_T4_EOT5_OT6_OT7_OT8_EUlSt10unique_ptrINS1_7PromiseIS7_EESt14default_deleteIS1J_EEOS9_OSF_OSP_OSS_PNS1_11ProcessBaseEE_IS1M_S9_SF_SP_SS_S1S_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1U_ Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f4731c7f4f1 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave13Containerizer12LaunchResultENSC_25MesosContainerizerProcessERKNSA_11ContainerIDERK6OptionINSA_5slave11ContainerIOEERKSt3mapISsSsSt4lessISsESaISt4pairIKSsSsEEERKSJ_ISsESI_SO_SY_S11_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMS16_FS14_T1_T2_T3_T4_EOT5_OT6_OT7_OT8_EUlSt10unique_ptrINS1_7PromiseISE_EESt14default_deleteIS1Q_EEOSG_OSM_OSW_OSZ_S3_E_IS1T_SG_SM_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_ Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f47325dbb31 process::ProcessBase::consume() Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f47325ea882 process::ProcessManager::resume() Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f47325efcf6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472fafa230 (unknown) Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f61ae25 start_thread Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @ 0x7f472f34834d __clone Dec 12 16:45:42 agent.hostname systemd[1]: dcos-mesos-slave.service: main process exited, code=killed, status=6/ABRT Dec 12 16:45:42 agent.hostname systemd[1]: Unit dcos-mesos-slave.service entered failed state. Dec 12 16:45:42 agent.hostname systemd[1]: dcos-mesos-slave.service failed.
Looking into Slave::_launch, indeed we find an unguarded access to the parent container's ContainerConfig here.
We recently added checkpointing of ContainerConfig to the Mesos containerizer. It seems that we are not appropriately handling upgrades, when there may be old containers running for which we do not expect to recover a ContainerConfig.