Details
Description
Hello,
we want to report an issue we observed when remove tasks from slave. There is condition to check for valid framework before tasks can be removed. There can be several reasons framework can be disconnected. This check fails and crashes mesos master node.
https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842
There is also unguarded access to the internal framework state on line 11853.
Error logs -
mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health check timed out mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check failed: framework != nullptr Framework 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } } mesos-master[5483]: *** Check failure stack trace: *** mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica received learned notice for position 42070 from log-network(1)@10.160.73.212:5050 mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail() mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog() mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush() mesos-master[5483]: @ 0x7f2fdf6a8859 google::LogMessageFatal::~LogMessageFatal() mesos-master[5483]: @ 0x7f2fde2677f2 mesos::internal::master::Master::__removeSlave() mesos-master[5483]: @ 0x7f2fde267ebe mesos::internal::master::Master::_markUnreachable() mesos-master[5483]: @ 0x7f2fde268215 _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbEEEEclEv mesos-master[5483]: @ 0x7f2fddf30688 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_ mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume() mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume() mesos-master[5483]: @ 0x7f2fdf60cb36 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread mesos-master[5483]: @ 0x7f2fdb20e8dd __clone systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT systemd[1]: Unit mesos-master.service entered failed state. systemd[1]: mesos-master.service failed. systemd[1]: mesos-master.service holdoff time over, scheduling restart. systemd[1]: Stopped Mesos Master. systemd[1]: Started Mesos Master. mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level logging started! mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 2020-05-09 10:42:00 by centos mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0 mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0 mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e
Attachments
Issue Links
- is fixed by
-
MESOS-9609 Master check failure when marking agent unreachable.
- Resolved