[MESOS-7432] Agent state can become corrupted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: agent, master
Labels:
None

Description

Under some circumstances master can crash with the following (Mesos 1.1):

I0415 02:51:35.302822 21132 master.cpp:8272] Adding task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a with resources … on agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371 (agent_host)
I0415 02:51:35.311982 21132 master.cpp:5426] Re-registered agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371 at slave(1)@agent_ip:5051 (agent_host) with cpus(*):24; mem(*):61261; ports(*):[31000-32000]; disk(*):66123; ephemeral_ports(*):[32768-57344]
I0415 02:51:35.312072 21132 master.cpp:5440] Shutting down framework 201205082337-0000000003-0000 at reregistered agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371 at slave(1)@agent_ip:5051 (agent_host) because the framework is not partition-aware
I0415 02:51:35.320315 21119 hierarchical.cpp:485] Added agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371 (agent_host) with cpus(*):24; mem(*):61261; ports(*):[31000-32000]; disk(*):66123; ephemeral_ports(*):[32768-57344] (allocated: {})
F0415 02:51:35.525313 21132 master.cpp:7729] Check failed: 'slave' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7f1270cb92bd  (unknown)
    @     0x7f1270cbb104  (unknown)
    @     0x7f1270cb8eac  (unknown)
    @     0x7f1270cbb9f9  (unknown)
    @     0x7f127024aded  (unknown)
    @     0x7f127021ad2f  (unknown)
    @     0x7f1270240724  (unknown)
    @     0x7f1270242be3  (unknown)
    @     0x7f1270c56371  (unknown)
    @     0x7f1270c56677  (unknown)
    @     0x7f1270d8f760  (unknown)
    @     0x7f126f23c83d  start_thread
    @     0x7f126ea2efdd  clone
I0415 02:51:38.058533 21111 mesos-master.sh:101] Master Exit Status: 134

Decoded stack trace:

Master::removeTask master/master.cpp:7729
Master::_reregisterSlave master/master.cpp:5450

This part of the code has been changed in 1.2 and tasks for removed agents are no longer added on re-registration, however issues that lead to this crash still exist. This crash seems confusing at first, because it fails to lookup agent, that was just added. However agent is registered using SlaveInfo, while lookup is performed based on slave_id in TaskInfo, indicating corrupted agent registration data.

Below is the sequence on events that lead to inconsistent agent state, which crashed master when it was sent by agent upon re-registration.
1. Agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12 registered with the master, and task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a was assigned to it:

I0411 23:52:28.516815  1748 slave.cpp:1115] Registered with master master@master_ip:5050; given agent ID b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
...
I0415 00:37:34.436111  1735 slave.cpp:1539] Got assigned task 'task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a' for framework 201205082337-0000000003-0000

2. Agent host was rebooted and agent started recovery when restarted. Agent correctly detected reboot, so it didn’t recover any data and started registration as a new agent. However agent was killed before it received registration confirmation from master.

I0415 01:08:13.300375  1772 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0415 01:08:13.300979  1772 state.cpp:698] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
I0415 01:08:13.301540  1772 state.cpp:88] Agent host rebooted
I0415 01:08:13.301772  1783 status_update_manager.cpp:203] Recovering status update manager
I0415 01:08:13.302006  1767 containerizer.cpp:555] Recovering containerizer
I0415 01:08:13.304095  1766 port_mapping.cpp:2302] Network isolator recovery complete
I0415 01:08:13.306538  1781 provisioner.cpp:253] Provisioner recovery complete
I0415 01:08:13.306686  1785 slave.cpp:5281] Finished recovery
I0415 01:08:13.308131  1785 slave.cpp:5314] Garbage collecting old agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
I0415 01:08:13.308218  1768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12' for gc 6.99999643315852days in the future
I0415 01:08:13.308629  1781 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12' for gc 6.99999642825185days in the future
2017-04-15 01:08:16,587:1512(0x7fb7a28c5700):ZOO_INFO@auth_completion_func@1300: Authentication scheme digest succeeded
I0415 01:08:16.587770  1771 group.cpp:418] Trying to create path '/home/mesos/prod/master' in ZooKeeper
I0415 01:08:16.589260  1777 detector.cpp:152] Detected a new leader: (id='316')
I0415 01:08:16.589419  1788 group.cpp:697] Trying to get '/home/mesos/prod/master/json.info_0000000316' in ZooKeeper
I0415 01:08:16.590313  1782 zookeeper.cpp:259] A new leading master (UPID=master@master_ip:5050) is detected
I0415 01:08:16.590425  1773 status_update_manager.cpp:177] Pausing sending status updates
I0415 01:08:16.590436  1781 slave.cpp:915] New master detected at master@master_ip:5050
I0415 01:08:16.590461  1781 slave.cpp:936] No credentials provided. Attempting to register without authentication
I0415 01:08:16.590490  1781 slave.cpp:947] Detecting new master
I0415 01:08:33.728121  1775 http.cpp:277] HTTP GET for /slave(1)/state.json from 127.0.0.1:60728 with User-Agent='python-requests/2.11.1'
I0415 01:09:13.298820  1764 slave.cpp:5044] Current disk usage 40.43%. Max allowed age: 3.469583558088658days
W0415 01:09:13.298820  1786 logging.cpp:91] RAW: Received signal SIGTERM from process 3892 of user 0; exiting
/usr/local/bin/mesos-slave.sh: line 99:  1512 Terminated              $debug /usr/local/sbin/mesos-slave "${MESOS_FLAGS[@]}"
Slave Exit Status: 143

Master was overwhelmed at the moment of registration, so registration took ~1min:

I0415 01:08:35.404637 21122 master.cpp:5113] Registering agent at slave(1)@agent_ip:5051 (agent_host) with id b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371
...
I0415 01:08:56.179731 21134 master.cpp:5101] Ignoring register agent message from slave(1)@agent_ip:5051 (agent_host) as admission is already in progress
...
I0415 01:09:27.317859 21128 master.cpp:5184] Registered agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371 at slave(1)@agent_ip:5051 (agent_host) with cpus(*):24; mem(*):61261; ports(*):[31000-32000]; disk(*):66123; ephemeral_ports(*):[32768-57344]

Important points in these logs:

bootId is checkpointed immediately after

I0415 01:08:13.306686  1785 slave.cpp:5281] Finished recovery

(see ~~MESOS-940~~)

agent did not receive registration confirmation, as it was killed at 01:09:13, and master finished registration at 01:09:27.

3. Agent was started again immediately after it was killed. Since it has checkpointed bootId during previous run, it couldn’t detect reboot, and recovered old data (with slaveId=b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12). Agent started re-registration, but it was delayed due to registration_backoff_factor. By that time master has finished registration of the previous agent instance and sent response with a new agent ID b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371. This made agent overwrite slaveId in SlaveInfo, making it inconsistent with salveId in TaskInfo.

I0415 01:09:18.849212  4004 slave.cpp:915] New master detected at master@master_ip:5050
I0415 01:09:18.849231  4004 slave.cpp:936] No credentials provided. Attempting to register without authentication
I0415 01:09:18.849247  4004 slave.cpp:947] Detecting new master
I0415 01:09:27.453253  3995 slave.cpp:1115] Registered with master master@master_ip:5050; given agent ID b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371

4. Everything kept running after that, with master periodically complaining:

W0415 01:09:27.490134 21127 master.cpp:5700] Ignoring status update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000 from removed agent slave(1)@agent_ip:5051 with id b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
W0415 01:09:37.512106 21130 master.cpp:5700] Ignoring status update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000 from removed agent slave(1)@agent_ip:5051 with id b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
W0415 01:09:57.465610 21137 master.cpp:5700] Ignoring status update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000 from removed agent slave(1)@agent_ip:5051 with id b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
W0415 01:10:37.477113 21123 master.cpp:5700] Ignoring status update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000 from removed agent slave(1)@agent_ip:5051 with id b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
W0415 01:11:57.660826 21116 master.cpp:5700] Ignoring status update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000 from removed agent slave(1)@agent_ip:5051 with id b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12
...

Eventually agent had to reregister with the master, and sent inconsistent data, causing the crash. Agent also crashed:

I0415 02:50:50.962502  4005 slave.cpp:4169] No pings from master received within 75secs
I0415 02:50:50.962779  3987 slave.cpp:904] Re-detecting master
I0415 02:50:50.962803  3987 slave.cpp:947] Detecting new master
I0415 02:50:50.962816  3992 status_update_manager.cpp:177] Pausing sending status updates
I0415 02:50:50.962927  4005 slave.cpp:915] New master detected at master@master_ip:5050
I0415 02:50:50.962949  4005 slave.cpp:936] No credentials provided. Attempting to register without authentication
I0415 02:50:50.962960  4005 slave.cpp:947] Detecting new master
I0415 02:50:50.962960  3999 status_update_manager.cpp:177] Pausing sending status updates
I0415 02:51:15.603344  3987 slave.cpp:5044] Current disk usage 41.01%. Max allowed age: 3.429157018507037days
I0415 02:51:35.559847  3986 slave.cpp:1217] Re-registered with master master@master_ip:5050
I0415 02:51:35.559919  3986 slave.cpp:1253] Forwarding total oversubscribed resources {}
I0415 02:51:35.559972  3991 status_update_manager.cpp:184] Resuming sending status updates
W0415 02:51:35.559994  3991 status_update_manager.cpp:191] Resending status update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000
I0415 02:51:35.560135  4002 slave.cpp:4051] Forwarding the update TASK_FAILED (UUID: b1ebcfc6-8f25-4ea8-a5eb-b5968e933a94) for task task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a of framework 201205082337-0000000003-0000 to master@master_ip:5050
I0415 02:51:35.727874  3999 slave.cpp:2536] Shutting down framework 201205082337-0000000003-0000
I0415 02:51:35.727900  3999 slave.cpp:4646] Cleaning up executor 'thermos-task1-d74842f3-52e7-49cf-9f96-8176b5e9aa1a' of framework 201205082337-0000000003-0000 at executor(1)@agent_ip:33430
F0415 02:51:35.728001  3999 slave.cpp:4671] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory 
*** Check failure stack trace: ***
    @     0x7f8b51dfe06d  (unknown)
    @     0x7f8b51dffe00  (unknown)
    @     0x7f8b51dfdc32  (unknown)
    @     0x7f8b51e00799  (unknown)
    @     0x7f8b51498fee  (unknown)
    @     0x7f8b514a4f8d  (unknown)
    @     0x7f8b514f65ba  (unknown)
    @     0x7f8b514b9564  (unknown)
    @     0x7f8b514db8c9  (unknown)
    @     0x7f8b51d95b94  (unknown)
    @     0x7f8b51d95ed7  (unknown)
    @     0x7f8b50047230  (unknown)
    @     0x7f8b502a2dc5  start_thread
    @     0x7f8b4fab273d  __clone
    @              (nil)  (unknown)
/usr/local/bin/mesos-slave.sh: line 99:  3983 Aborted                 (core dumped) $debug /usr/local/sbin/mesos-slave "${MESOS_FLAGS[@]}"
Slave Exit Status: 134

#0  0x00007f8b4f9f01d7 in raise () from /lib64/libc.so.6
#1  0x00007f8b4f9f18c8 in abort () from /lib64/libc.so.6
#2  0x00007f8b51e0688c in google::DumpStackTraceAndExit() () from /usr/local/lib64/libmesos-1.1.0.so
#3  0x00007f8b51dfe06d in google::LogMessage::Fail() () from /usr/local/lib64/libmesos-1.1.0.so
#4  0x00007f8b51dffe00 in google::LogMessage::SendToLog() () from /usr/local/lib64/libmesos-1.1.0.so
#5  0x00007f8b51dfdc32 in google::LogMessage::Flush() () from /usr/local/lib64/libmesos-1.1.0.so
#6  0x00007f8b51e00799 in google::LogMessageFatal::~LogMessageFatal() () from /usr/local/lib64/libmesos-1.1.0.so
#7  0x00007f8b51498fee in mesos::internal::slave::Slave::removeExecutor(mesos::internal::slave::Framework*, mesos::internal::slave::Executor*) ()
   from /usr/local/lib64/libmesos-1.1.0.so
#8  0x00007f8b514a4f8d in mesos::internal::slave::Slave::shutdownFramework(process::UPID const&, mesos::FrameworkID const&) () from /usr/local/lib64/libmesos-1.1.0.so
#9  0x00007f8b514f65ba in void ProtobufProcess<mesos::internal::slave::Slave>::handler1<mesos::internal::ShutdownFrameworkMessage, mesos::FrameworkID const&, mesos::FrameworkID const&>(mesos::internal::slave::Slave*, void (mesos::internal::slave::Slave::*)(process::UPID const&, mesos::FrameworkID const&), mesos::FrameworkID const& (mesos::internal::ShutdownFrameworkMessage::*)() const, process::UPID const&, std::string const&) () from /usr/local/lib64/libmesos-1.1.0.so
#10 0x00007f8b514b9564 in std::_Function_handler<void (process::UPID const&, std::string const&), std::_Bind<void (*(mesos::internal::slave::Slave*, void (mesos::internal::slave::Slave::*)(process::UPID const&, mesos::FrameworkID const&), mesos::FrameworkID const& (mesos::internal::ShutdownFrameworkMessage::*)() const, std::_Placeholder<1>, std::_Placeholder<2>))(mesos::internal::slave::Slave*, void (mesos::internal::slave::Slave::*)(process::UPID const&, mesos::FrameworkID const&), mesos::FrameworkID const& (mesos::internal::ShutdownFrameworkMessage::*)() const, process::UPID const&, std::string const&)> >::_M_invoke(std::_Any_data const&, process::UPID const&, std::string const&) ()
   from /usr/local/lib64/libmesos-1.1.0.so
#11 0x00007f8b514db8c9 in ProtobufProcess<mesos::internal::slave::Slave>::visit(process::MessageEvent const&) () from /usr/local/lib64/libmesos-1.1.0.so
#12 0x00007f8b51d95b94 in process::ProcessManager::resume(process::ProcessBase*) () from /usr/local/lib64/libmesos-1.1.0.so
#13 0x00007f8b51d95ed7 in std::thread::_Impl<std::_Bind_simple<process::ProcessManager::init_threads()::{unnamed type#1} ()> >::_M_run() ()
   from /usr/local/lib64/libmesos-1.1.0.so
#14 0x00007f8b50047230 in ?? () from /lib64/libstdc++.so.6
#15 0x00007f8b502a2dc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f8b4fab273d in clone () from /lib64/libc.so.6

Agent also failed to recover after that, saving master from crashing again:

I0415 02:51:36.273108 60951 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0415 02:51:36.273157 60951 state.cpp:698] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
W0415 02:51:36.273499 60951 state.cpp:212] Failed to find framework info file '/var/lib/mesos/meta/slaves/b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371/frameworks/201205082337-0000000003-0000/framework.info'
I0415 02:51:36.273879 60955 slave.cpp:5369] Recovering framework 201205082337-0000000003-0000
I0415 02:51:36.273955 60945 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371/frameworks/201205082337-0000000003-0000' for gc 6.94708016299852days in the future
I0415 02:51:36.274008 60945 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/b8f47842-5f08-446a-9b1f-532fe3bd47ed-S3371/frameworks/201205082337-0000000003-0000' for gc 6.94708016259259days in the future
I0415 02:51:36.274042 60948 status_update_manager.cpp:203] Recovering status update manager
I0415 02:51:36.274241 60943 containerizer.cpp:555] Recovering containerizer
I0415 02:51:36.275825 60948 port_mapping.cpp:2302] Network isolator recovery complete
I0415 02:51:36.277570 60946 provisioner.cpp:253] Provisioner recovery complete
I0415 02:51:36.277676 60940 slave.cpp:5281] Finished recovery
I0415 02:51:36.277961 60940 slave.cpp:5314] Garbage collecting old agent b8f47842-5f08-446a-9b1f-532fe3bd47ed-S12

Therefore master crash seems to involve several issues:
1. bootId is checkpointed too early in the agent startup process. Restarting agent between recovery and registration can cause agent fail to detect reboot.
2. After restart, agent can receive messages designated to its predecessor. In this case it caused confusion on agent, tricking it into assuming it was registered as a new agent.

Attachments

Issue Links

is related to

MESOS-7795 Remove "latest" symlink after agent reboot

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dmitry Zhuk

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Apr/17 15:57

Updated:: 13/Mar/19 11:52