[MESOS-2605] The slave sometimes does not send active executors during reregistration - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.22.0
Fix Version/s: None
Component/s: None
Labels:
- mesosphere

Sprint:
Mesosphere Q1 Sprint 7 - 4/17, Mesosphere Q2 Sprint 8 - 5/1

Description

The slave sometimes does not send active executors during reregistration. Framework checkpointing is enabled, and the executor successfully reregisters. However, the tasks in that executor are LOST (by abnormal executor termination) because the executor is removed by the mesos master as unknown. See the example below, task.journalnode.journalnode.NodeExecutor.1428609184051.

See the Slave Logs here for the Task:

Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008

Master Logs:

Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666  1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652  1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5 from framework 20150408-002100-4261056010-5050-1047-0008
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712  1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)

Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372  1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)


Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746  1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767  1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008

Attachments

Issue Links

relates to

MESOS-1800 The slave does not send pending executors during re-registration.

Open

MESOS-1715 The slave does not send pending tasks during re-registration.

Resolved

MESOS-1720 Slave should send exited executor message when the executor is never launched.

Resolved

MESOS-2601 Tasks are not removed after recovery from slave and mesos containerizer

Resolved

The slave sometimes does not send active executors during reregistration

Details

Description

Attachments

Issue Links

Activity

People

Dates