Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.22.0
-
None
-
None
-
Mesosphere Q1 Sprint 7 - 4/17, Mesosphere Q2 Sprint 8 - 5/1
Description
The slave sometimes does not send active executors during reregistration. Framework checkpointing is enabled, and the executor successfully reregisters. However, the tasks in that executor are LOST (by abnormal executor termination) because the executor is removed by the mesos master as unknown. See the example below, task.journalnode.journalnode.NodeExecutor.1428609184051.
See the Slave Logs here for the Task:
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Master Logs:
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666 1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com) Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652 1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5 from framework 20150408-002100-4261056010-5050-1047-0008 Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712 1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com) Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372 1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com) Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746 1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com) Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767 1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Attachments
Issue Links
- relates to
-
MESOS-1800 The slave does not send pending executors during re-registration.
- Open
-
MESOS-1715 The slave does not send pending tasks during re-registration.
- Resolved
-
MESOS-1720 Slave should send exited executor message when the executor is never launched.
- Resolved
-
MESOS-2601 Tasks are not removed after recovery from slave and mesos containerizer
- Resolved