Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2605

The slave sometimes does not send active executors during reregistration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.22.0
    • None
    • None
    • Mesosphere Q1 Sprint 7 - 4/17, Mesosphere Q2 Sprint 8 - 5/1

    Description

      The slave sometimes does not send active executors during reregistration. Framework checkpointing is enabled, and the executor successfully reregisters. However, the tasks in that executor are LOST (by abnormal executor termination) because the executor is removed by the mesos master as unknown. See the example below, task.journalnode.journalnode.NodeExecutor.1428609184051.

      See the Slave Logs here for the Task:

      Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
      Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
      Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050
      Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638
      Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
      Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
      

      Master Logs:

      Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 20:19:43.008666  1067 master.cpp:4015] Executor executor.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
      Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008652  1074 hierarchical.hpp:648] Recovered cpus(*):0.1; mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 20150407-233647-2059219722-5050-1659-S5 from framework 20150408-002100-4261056010-5050-1047-0008
      Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.008712  1067 master.cpp:4714] Removing executor 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
      
      Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.010372  1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
      
      
      Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013746  1067 master.cpp:3295] Status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008 from slave 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 (ec2-54-237-57-237.compute-1.amazonaws.com)
      Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 20:19:43.013767  1067 master.cpp:3336] Forwarding status update TASK_LOST (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 20150408-002100-4261056010-5050-1047-0008
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              elingg Elizabeth Lingg
              Adam B Adam B
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: