Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-533

SlaveRecoveryTest/0.CleanupExecutor is flaky on Jenkins.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:

      Description

      Looks like there is a bad Result.get() call lurking in the code!

      [ RUN ] SlaveRecoveryTest/0.CleanupExecutor
      I0627 23:10:37.469480 11637 master.cpp:228] Master started on 67.195.138.61:42260
      I0627 23:10:37.469604 11637 master.cpp:243] Master ID: 201306272310-1032504131-42260-11206
      I0627 23:10:37.470226 11634 hierarchical_allocator_process.hpp:295] Initializing hierarchical allocator process with master : master@67.195.138.61:42260
      W0627 23:10:37.470305 11632 master.cpp:83] No whitelist given. Advertising offers for all slaves
      I0627 23:10:37.470399 11637 master.cpp:526] Elected as master!
      I0627 23:10:37.486925 11633 slave.cpp:112] Slave started on 22)@67.195.138.61:42260
      I0627 23:10:37.487216 11633 slave.cpp:204] Slave resources: cpus=2; mem=1024; ports=[31000-32000]; disk=1024
      I0627 23:10:37.487869 11639 sched.cpp:177] New master at master@67.195.138.61:42260
      I0627 23:10:37.495751 11633 slave.cpp:528] New master detected at master@67.195.138.61:42260
      I0627 23:10:37.495753 11632 process_isolator.cpp:303] Recovering isolator
      I0627 23:10:37.495971 11639 master.cpp:569] Registering framework 201306272310-1032504131-42260-11206-0000 at scheduler(18)@67.195.138.61:42260
      I0627 23:10:37.497521 11632 sched.cpp:222] Framework registered with 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.496054 11634 status_update_manager.cpp:155] New master detected at master@67.195.138.61:42260
      I0627 23:10:37.497534 11639 hierarchical_allocator_process.hpp:327] Added framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.498798 11639 hierarchical_allocator_process.hpp:705] No resources available to allocate!
      I0627 23:10:37.499268 11639 hierarchical_allocator_process.hpp:667] Performed allocation for 0 slaves in 472.547us
      I0627 23:10:37.496024 11633 slave.cpp:543] Postponing registration until recovery is complete
      I0627 23:10:37.500259 11633 slave.cpp:389] Finished recovery
      I0627 23:10:37.501011 11635 master.cpp:891] Attempting to register slave on juno.apache.org at slave(22)@67.195.138.61:42260
      I0627 23:10:37.501247 11635 master.cpp:1851] Adding slave 201306272310-1032504131-42260-11206-0 at juno.apache.org with cpus=2; mem=1024; ports=[31000-32000]; disk=1024
      I0627 23:10:37.501842 11638 slave.cpp:588] Registered with master master@67.195.138.61:42260; given slave ID 201306272310-1032504131-42260-11206-0
      I0627 23:10:37.501883 11635 hierarchical_allocator_process.hpp:449] Added slave 201306272310-1032504131-42260-11206-0 (juno.apache.org) with cpus=2; mem=1024; ports=[31000-32000]; disk=1024 (and cpus=2; mem=1024; ports=[31000-32000]; disk=1024 available)
      I0627 23:10:37.502459 11638 paths.hpp:336] Created slave directory '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0'
      Checkpointing SlaveInfo to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/slave.info'
      I0627 23:10:37.502784 11635 hierarchical_allocator_process.hpp:727] Offering cpus=2; mem=1024; ports=[31000-32000]; disk=1024 on slave 201306272310-1032504131-42260-11206-0 to framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.503823 11635 hierarchical_allocator_process.hpp:687] Performed allocation for slave 201306272310-1032504131-42260-11206-0 in 1.055ms
      I0627 23:10:37.503890 11638 master.hpp:313] Adding offer 201306272310-1032504131-42260-11206-0 with resources cpus=2; mem=1024; ports=[31000-32000]; disk=1024 on slave 201306272310-1032504131-42260-11206-0
      I0627 23:10:37.505142 11638 master.cpp:1239] Sending 1 offers to framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.507694 11633 master.cpp:1472] Processing reply for offer 201306272310-1032504131-42260-11206-0 on slave 201306272310-1032504131-42260-11206-0 (juno.apache.org) for framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.507940 11633 master.hpp:291] Adding task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 with resources cpus=2; mem=1024; ports=[31000-32000]; disk=1024 on slave 201306272310-1032504131-42260-11206-0
      I0627 23:10:37.508301 11633 master.cpp:1591] Launching task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000 with resources cpus=2; mem=1024; ports=[31000-32000]; disk=1024 on slave 201306272310-1032504131-42260-11206-0 (juno.apache.org)
      I0627 23:10:37.508805 11636 slave.cpp:738] Got assigned task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 for framework 201306272310-1032504131-42260-11206-0000
      Checkpointing FrameworkInfo to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/framework.info'
      Checkpointing 'scheduler(18)@67.195.138.61:42260' to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/framework.pid'
      I0627 23:10:37.510175 11636 slave.cpp:836] Launching task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 for framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.511561 11636 paths.hpp:303] Created executor directory '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554'
      Checkpointing ExecutorInfo to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/executor.info'
      I0627 23:10:37.511950 11636 paths.hpp:303] Created executor directory '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554'
      Checkpointing Task to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554/tasks/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/task.info'
      I0627 23:10:37.513481 11635 process_isolator.cpp:99] Launching d5edf5d2-94e1-47d6-ae45-03e8844d5f36 (/home/jenkins/jenkins-slave/workspace/Mesos-Trunk-Ubuntu-Build-In-Src-Set-JAVA_HOME/src/mesos-executor) in /tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554 with resources ' for framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.508839 11633 master.hpp:323] Removing offer 201306272310-1032504131-42260-11206-0 with resources cpus=2; mem=1024; ports=[31000-32000]; disk=1024 on slave 201306272310-1032504131-42260-11206-0
      I0627 23:10:37.518288 11636 slave.cpp:947] Queuing task 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' for executor d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework '201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.518731 11636 slave.cpp:510] Successfully attached file '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554'
      I0627 23:10:37.519456 11635 process_isolator.cpp:161] Forked executor at 12414
      Checkpointing forked pid 12414
      Checkpointing '12414' to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554/pids/forked.pid'
      Fetching resources into '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554'
      I0627 23:10:37.561638 11640 process.cpp:878] Socket closed while receiving
      WARNING: Logging before InitGoogleLogging() is written to STDERR
      I0627 23:10:37.600602 12434 process.cpp:1426] libprocess is initialized on 67.195.138.61:43332 for 8 cpus
      I0627 23:10:37.610895 12450 exec.cpp:170] Executor started at: executor(1)@67.195.138.61:43332 with pid 12434
      I0627 23:10:37.611531 11637 slave.cpp:1394] Got registration for executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      Checkpointing 'executor(1)@67.195.138.61:43332' to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554/pids/libprocess.pid'
      I0627 23:10:37.612431 12450 exec.cpp:194] Executor registered on slave 201306272310-1032504131-42260-11206-0
      Registered executor on juno.apache.org
      I0627 23:10:37.612457 12456 process.cpp:878] Socket closed while receiving
      I0627 23:10:37.612041 11637 slave.cpp:1509] Flushing queued task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 for executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.613847 12456 process.cpp:878] Socket closed while receiving
      I0627 23:10:37.613988 12452 exec.cpp:258] Executor asked to run task 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36'
      Starting task d5edf5d2-94e1-47d6-ae45-03e8844d5f36
      Forked command at 12457
      sh -c 'sleep 1000'
      I0627 23:10:37.615681 12452 exec.cpp:404] Executor sending status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.615947 11633 slave.cpp:1691] Handling status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000 from executor(1)@67.195.138.61:43332
      I0627 23:10:37.616220 11633 status_update_manager.cpp:290] Received status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000 with checkpoint=true
      I0627 23:10:37.616529 11633 status_update_manager.cpp:450] Creating StatusUpdate stream for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.617097 11633 status_update_manager.hpp:324] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.706604 11633 status_update_manager.cpp:336] Forwarding status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000 to master@67.195.138.61:42260
      I0627 23:10:37.707031 11636 slave.cpp:1802] Status update manager successfully handled status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.707044 11633 master.cpp:1022] Status update from slave(22)@67.195.138.61:42260: task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000 is now in state TASK_RUNNING
      I0627 23:10:37.707169 11636 slave.cpp:1808] Sending acknowledgement for status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000 to executor(1)@67.195.138.61:43332
      I0627 23:10:37.708674 12456 process.cpp:878] Socket closed while receiving
      I0627 23:10:37.708730 12448 exec.cpp:290] Executor received status update acknowledgement 1c5fd31b-7a14-4573-a116-ff8baac5c907 for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.708386 11636 status_update_manager.cpp:360] Received status update acknowledgement 1c5fd31b-7a14-4573-a116-ff8baac5c907 for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.710053 11636 status_update_manager.hpp:324] Checkpointing ACK for status update TASK_RUNNING (UUID: 1c5fd31b-7a14-4573-a116-ff8baac5c907) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.904566 11636 slave.cpp:1342] Status update manager successfully handled status update acknowledgement 1c5fd31b-7a14-4573-a116-ff8baac5c907 for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.905002 11206 slave.cpp:439] Slave terminating
      I0627 23:10:37.905333 11632 master.cpp:460] Slave 201306272310-1032504131-42260-11206-0 (juno.apache.org) disconnected
      I0627 23:10:37.906338 11636 slave.cpp:112] Slave started on 23)@67.195.138.61:42260
      I0627 23:10:37.906491 11637 hierarchical_allocator_process.hpp:667] Performed allocation for 1 slaves in 41.695us
      I0627 23:10:37.906744 11636 slave.cpp:204] Slave resources: cpus=2; mem=1024; ports=[31000-32000]; disk=1024
      I0627 23:10:37.907874 11636 state.cpp:33] Recovering state from /tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta
      I0627 23:10:37.909067 11632 status_update_manager.cpp:177] Recovering status update manager
      I0627 23:10:37.909111 11632 status_update_manager.cpp:181] Recovering executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.909427 11636 slave.cpp:528] New master detected at master@67.195.138.61:42260
      I0627 23:10:37.909481 11632 status_update_manager.cpp:450] Creating StatusUpdate stream for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.910508 11632 status_update_manager.hpp:288] Replaying status update stream for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36
      I0627 23:10:37.911044 11635 process_isolator.cpp:303] Recovering isolator
      I0627 23:10:37.911408 11635 process_isolator.cpp:311] Recovering executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.909950 11636 slave.cpp:537] Skipping registration because slave is in 'cleanup' mode
      I0627 23:10:37.912358 11636 slave.cpp:2819] Recovering executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.912951 11636 slave.cpp:2662] Sending shutdown to executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000 to executor(1)@67.195.138.61:43332
      I0627 23:10:37.913314 11636 slave.cpp:2327] Shutting down executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.914327 12456 process.cpp:878] Socket closed while receiving
      I0627 23:10:37.914351 12455 exec.cpp:323] Executor asked to shutdown
      Shutting down
      I0627 23:10:37.914877 12454 exec.cpp:75] Scheduling shutdown of the executor
      I0627 23:10:37.911058 11632 status_update_manager.cpp:155] New master detected at master@67.195.138.61:42260
      I0627 23:10:37.915552 11636 slave.cpp:510] Successfully attached file '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/executors/d5edf5d2-94e1-47d6-ae45-03e8844d5f36/runs/a671f481-eaea-4ffb-bde3-93efa1c7c554'
      I0627 23:10:37.915717 11636 slave.cpp:389] Finished recovery
      W0627 23:10:37.916795 11636 master.cpp:944] Slave at slave(23)@67.195.138.61:42260 (juno.apache.org) is being allowed to re-register with an already in use id (201306272310-1032504131-42260-11206-0)
      W0627 23:10:37.916992 11636 master.cpp:944] Slave at slave(23)@67.195.138.61:42260 (juno.apache.org) is being allowed to re-register with an already in use id (201306272310-1032504131-42260-11206-0)
      I0627 23:10:37.917031 11634 slave.cpp:629] Re-registered with master master@67.195.138.61:42260
      I0627 23:10:37.918179 11634 slave.cpp:1276] Updating framework 201306272310-1032504131-42260-11206-0000 pid to scheduler(18)@67.195.138.61:42260
      Checkpointing 'scheduler(18)@67.195.138.61:42260' to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/framework.pid'
      W0627 23:10:37.918856 11634 slave.cpp:643] Already re-registered with master master@67.195.138.61:42260
      I0627 23:10:37.919612 11634 slave.cpp:1276] Updating framework 201306272310-1032504131-42260-11206-0000 pid to scheduler(18)@67.195.138.61:42260
      Checkpointing 'scheduler(18)@67.195.138.61:42260' to '/tmp/SlaveRecoveryTest_0_CleanupExecutor_BF7VIa/meta/slaves/201306272310-1032504131-42260-11206-0/frameworks/201306272310-1032504131-42260-11206-0000/framework.pid'
      I0627 23:10:37.926779 11635 hierarchical_allocator_process.hpp:667] Performed allocation for 1 slaves in 53.179us
      I0627 23:10:37.936836 11639 hierarchical_allocator_process.hpp:667] Performed allocation for 1 slaves in 39.736us
      I0627 23:10:37.947145 11638 hierarchical_allocator_process.hpp:667] Performed allocation for 1 slaves in 34.052us
      W0627 23:10:37.956991 11637 master.cpp:83] No whitelist given. Advertising offers for all slaves
      I0627 23:10:37.957289 11633 slave.cpp:2396] Killing executor 'd5edf5d2-94e1-47d6-ae45-03e8844d5f36' of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.957679 11632 hierarchical_allocator_process.hpp:667] Performed allocation for 1 slaves in 50.929us
      Waited on process 12457, returned status 9
      Performing killtree operation on 12457
      Stopped 12457
      Children of 12457:

      { 12459 }

      Members of group 12457:

      { 12457, 12459 }
      Members of session 12457: { 12457, 12459 }

      Stopped 12459
      Children of 12459: { }
      Members of group 12457:

      { 12457, 12459 }
      Members of session 12457: { 12457, 12459 }

      Signaled 12457
      Signaled 12459
      Result::get() but state == NONE
      I0627 23:10:37.968878 12455 exec.cpp:404] Executor sending status update TASK_FAILED (UUID: 45764b93-5acd-4ba8-923e-e39fd25a8f8c) for task d5edf5d2-94e1-47d6-ae45-03e8844d5f36 of framework 201306272310-1032504131-42260-11206-0000
      I0627 23:10:37.976610 12456 process.cpp:876] Socket error while receiving: Connection reset by peer
      make[3]: *** [check-local] Aborted
      I0627 23:10:37.976788 12454 exec.cpp:348] Ignoring exited event because the driver is aborted!

        Attachments

          Activity

            People

            • Assignee:
              vinodkone Vinod Kone
              Reporter:
              bmahler Benjamin Mahler
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: