Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.5.0
-
None
Description
Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815433 13 http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133 Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815588 13 master.cpp:5467] Processing DECLINE call for offers: [ 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 5e57f633-a69c-4009-b7 Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815693 13 master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820142 10 master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820367 10 registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the registry Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820572 10 registrar.cpp:552] Successfully updated the registry in 175872ns Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820642 11 master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957 9 hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.851961 11 master.cpp:10018] Check failed: 'framework' Must be non NULL Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: *** Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d google::LogMessage::Fail() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 google::LogMessage::SendToLog() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 google::LogMessage::Flush() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 google::CheckNotNull<>() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 process::ProcessBase::consume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a process::ProcessManager::resume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown) Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) try "date -d @1520762676" if you are using GNU date *** Mar 11 10:04:36 research docker[4503]: PC: @ 0x7f96c2a4d196 (unknown) Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 (TID 0x7f96b986d700) from PID 0; stack trace: *** Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2df1390 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2a4d196 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c604ce2c google::DumpStackTraceAndExit() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044a7d google::LogMessage::Fail() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6046830 google::LogMessage::SendToLog() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6044663 google::LogMessage::Flush() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5258e14 google::CheckNotNull<>() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5f98f11 process::ProcessBase::consume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb2a4a process::ProcessManager::resume() Mar 11 10:04:36 research docker[4503]: @ 0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv Mar 11 10:04:36 research docker[4503]: @ 0x7f96c35d4c80 (unknown) Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2de76ba start_thread Mar 11 10:04:36 research docker[4503]: @ 0x7f96c2b1d41d (unknown) Mar 11 10:04:38 research systemd[1]: mesos-master2.service: main process exited, code=exited, status=139/n/a Mar 11 10:04:38 research docker[18886]: mesos-master Mar 11 10:04:38 research systemd[1]: Unit mesos-master2.service entered failed state.
Additional case:
I0715 02:56:40.071446 13 master.cpp:1295] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) disconnected I0715 02:56:40.071503 13 master.cpp:3333] Disconnecting agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) I0715 02:56:40.071527 13 master.cpp:3352] Deactivating agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) I0715 02:56:40.071563 13 master.cpp:1319] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from disconnected agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) because the framework is not checkpointing I0715 02:56:40.071579 13 master.cpp:11006] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) I0715 02:56:40.071583 12 hierarchical.cpp:829] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 deactivated I0715 02:56:40.071619 13 master.cpp:11766] Removing executor 'toil-41' with resources {} of framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 on agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) I0715 02:58:08.642220 12 master.cpp:9130] Marking agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out I0715 02:58:08.642675 11 registrar.cpp:487] Applied 1 operations in 305592ns; attempting to update the registry I0715 02:58:08.642922 13 registrar.cpp:544] Successfully updated the registry in 187904ns I0715 02:58:08.643081 17 master.cpp:9173] Marked agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out F0715 02:58:08.643210 17 master.cpp:11402] Check failed: 'framework' Must be non NULL *** Check failure stack trace: *** I0715 02:58:08.643254 12 hierarchical.cpp:680] Removed agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 @ 0x7ffbcffd090d google::LogMessage::Fail() @ 0x7ffbcffd2748 google::LogMessage::SendToLog() @ 0x7ffbcffd04f3 google::LogMessage::Flush() @ 0x7ffbcffd31d9 google::LogMessageFatal::~LogMessageFatal() @ 0x7ffbcec65024 google::CheckNotNull<>() @ 0x7ffbcec32658 mesos::internal::master::Master::__removeSlave() @ 0x7ffbcec33b13 mesos::internal::master::Master::_markUnreachable() @ 0x7ffbcec33e55 _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlbE_JbEEEEclEv @ 0x7ffbce93d5d8 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_ @ 0x7ffbcff18371 process::ProcessBase::consume() @ 0x7ffbcff3a97a process::ProcessManager::resume() @ 0x7ffbcff3e6a6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7ffbcc2cd9e0 (unknown) @ 0x7ffbcbde06db start_thread @ 0x7ffbcbb0988f (unknown)
After an analysis of the more recent logs included in the comments below, the following seems to occur:
1) The last of the framework’s tasks is removed:
Oct 27 23:21:18 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:21:18.493418 15 master.cpp:12171] Removing task 2 with resources cpus(allocated: *):1; disk(allocated: *):4024; mem(allocated: *):2048 of framework 522424c1-2fac-42ab-9a70-b424266218a9-0000 on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144)
which means the framework’s entry in slave->tasks is erased: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L13527-L13529
2) Later, the agent disconnects and since the framework is not checkpointing, it is removed from the Slave struct:
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248260 14 master.cpp:1321] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9-0000 (toil) from disconnected agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144) because the framework is not checkpointing
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248289 14 master.cpp:11436] Removing framework 522424c1-2fac-42ab-9a70-b424266218a9-0000 (toil) from agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144)
Oct 27 23:23:20 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:20.248311 14 master.cpp:12211] Removing executor 'toil-440' with resources {} of framework 522424c1-2fac-42ab-9a70-b424266218a9-0000 on agent 522424c1-2fac-42ab-9a70-b424266218a9-S0 at slave(1)@10.0.143.144:5051 (10.0.143.144)
We see no logging related to task removal since slave->tasks[framework->id()] was empty this time. however, since we use operator[] to inspect the task map here, we perform an insertion and it has a side effect: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11416
This means that slave->tasks[framework->id()] now exists but has been initialized to an empty map. ruh roh.
3) Very soon after, the framework failover timeout elapses and the framework is removed:
Oct 27 23:23:22 ip-10-0-131-86.ec2.internal docker[1839]: I1027 23:23:22.890070 11 master.cpp:10224] Framework failover timeout, removing framework 522424c1-2fac-42ab-9a70-b424266218a9-0000 (toil)
4) Now when __removeSlave() iterates over the keys of slave->tasks, it finds a key which points to a framework that has already been removed: https://github.com/apache/mesos/blob/e13929d62663015162db7e66c6600fe414d03ec3/src/master/master.cpp#L11796-L11800
We need to prevent that unintended map insertion to avoid the crash.
The resolution of this ticket should also involve the writing of a regression test.
Attachments
Attachments
Issue Links
- fixes
-
MESOS-10146 Removing task from slave when framework is disconnected causes master to crash
-
- Resolved
-
- is related to
-
MESOS-7991 fatal, check failed !framework->recovered()
-
- Resolved
-