Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1966

TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to kill non-exist Task indefinitely

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reviewable
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      When a Task launch fails, it is moved from ASSIGNED to LOST, which performs a RESCHEDULE and KILL. Unfortunately the KILL of a non-existent task to the Mesos master results in a TASK_UNKNOWN status update, which gets mapped to PARTITIONED. While the transition from LOST to PARTITIONED is not allowed, some callbacks get executed despite the fact, resulting in a KILL and RESCHEDULE action. This new KILL triggers another TASK_UNKNOWN and hence PARTITIONED status update for the same task, putting the Scheduler to indefinitely attempt KILLing the non-existent task. Attempting a client job killall results in the same state for the scheduler.

      Since the scheduler uses the LOST state for black-holing task the TaskStateMachine needs to take those into account.

      I was able to reproduce this in the Vagrant image by faking a launch failure.

      I0124 05:48:23.198 [qtp1791010542-40, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 state machine transition INIT -> PENDING I0124 05:48:23.213508 9748 log.cpp:560] Attempting to append 1679 bytes to the log I0124 05:48:23.214570 9748 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 24778 I0124 05:48:23.214834 9748 replica.cpp:540] Replica received write request for position 24778 from __req_res__(4)@192.168.33.7:8083 I0124 05:48:23.221982 9748 leveldb.cpp:341] Persisting action (1700 bytes) to leveldb took 6.772102ms I0124 05:48:23.222174 9748 replica.cpp:711] Persisted action APPEND at position 24778 I0124 05:48:23.222901 9748 replica.cpp:694] Replica received learned notice for position 24778 from log-network(1)@192.168.33.7:8083 I0124 05:48:23.226833 9748 leveldb.cpp:341] Persisting action (1702 bytes) to leveldb took 3.227779ms I0124 05:48:23.227008 9748 replica.cpp:711] Persisted action APPEND at position 24778 I0124 05:48:23.262 [qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 +0000] "POST //aurora.local/api HTTP/1.1" 200 78 I0124 05:48:23.267 [qtp1791010542-40, LoggingInterceptor] getTasksWithoutConfigs(TaskQuery(role:null, environment:null, jobName:null, taskIds:null, statuses:null, instanceIds:null, slaveHosts:null, jobKeys:[JobKey(role:vagrant, environment:test, name:fail-partition_aware_disabled)], offset:0, limit:0)) I0124 05:48:23.285 [qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 +0000] "POST //aurora.local/api HTTP/1.1" 200 794 I0124 05:48:23.349 [TaskGroupBatchWorker, StateMachine] Callback transition PENDING to ASSIGNED, allow: true I0124 05:48:23.353 [TaskGroupBatchWorker, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 state machine transition PENDING -> ASSIGNED I0124 05:48:23.356 [TaskGroupBatchWorker, TaskAssignerImpl] Offer on agent 192.168.33.7 (id fe8bc641-aa02-4363-a990-318d20de1bac-S0) is being assigned task for vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606. W0124 05:48:23.445 [TaskGroupBatchWorker, TaskAssignerImpl] Failed to launch task. org.apache.aurora.scheduler.offers.OfferManager$LaunchException: Failed to launch task. at org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:212) at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.launchUsingOffer(TaskAssignerImpl.java:126) at org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.maybeAssign(TaskAssignerImpl.java:262) at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.scheduleTasks(TaskSchedulerImpl.java:154) at org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.schedule(TaskSchedulerImpl.java:108) at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.scheduling.TaskGroups$1.lambda$run$0(TaskGroups.java:174) at org.apache.aurora.scheduler.BatchWorker$Work.apply(BatchWorker.java:117) at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144) at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139) at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199) at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89) at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198) at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221) at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132) at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207) at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199) at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66) at com.google.common.util.concurrent.Callables$4.run(Callables.java:122) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: mocking launch failure at org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:201) ... 23 common frames omitted I0124 05:48:23.446 [TaskGroupBatchWorker, StateMachine] Callback transition ASSIGNED to LOST, allow: true I0124 05:48:23.447 [TaskGroupBatchWorker, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 state machine transition ASSIGNED -> LOST I0124 05:48:23.449 [TaskGroupBatchWorker, StateManagerImpl] Task being rescheduled: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.454 [TaskGroupBatchWorker, StateMachine] Callback transition INIT to PENDING, allow: true I0124 05:48:23.456 [TaskGroupBatchWorker, StateMachine] vagrant-test-fail-partition_aware_disabled-0-a882eee4-b214-49e1-be9b-1b60d119ed60 state machine transition INIT -> PENDING I0124 05:48:23.473 [TaskGroupBatchWorker, StateManagerImpl] Task being killed: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.476 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Killing task vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.496659 9751 scheduler.cpp:243] Sending KILL call to http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.499577 9749 scheduler.cpp:676] Enqueuing event UPDATE received from http://192.168.33.7:5050/master/api/v1/scheduler W0124 05:48:23.505 [AsyncProcessor-4, TaskVars] Failed to find rack attribute associated with host 192.168.33.7 W0124 05:48:23.515 [AsyncProcessor-4, Stats] Re-using already registered variable for key task_delivery_delay_SOURCE_MASTER_timeouts_per_sec W0124 05:48:23.515 [AsyncProcessor-4, Stats] Re-using already registered variable for key task_delivery_delay_SOURCE_MASTER_requests_per_sec I0124 05:48:23.524458 9754 log.cpp:560] Attempting to append 3625 bytes to the log I0124 05:48:23.524636 9754 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 24779 I0124 05:48:23.525111 9754 replica.cpp:540] Replica received write request for position 24779 from __req_res__(5)@192.168.33.7:8083 I0124 05:48:23.526832 9754 leveldb.cpp:341] Persisting action (3646 bytes) to leveldb took 1.612374ms I0124 05:48:23.526927 9754 replica.cpp:711] Persisted action APPEND at position 24779 I0124 05:48:23.527189 9754 replica.cpp:694] Replica received learned notice for position 24779 from log-network(1)@192.168.33.7:8083 I0124 05:48:23.528949 9754 leveldb.cpp:341] Persisting action (3648 bytes) to leveldb took 1.543755ms I0124 05:48:23.529088 9754 replica.cpp:711] Persisted action APPEND at position 24779 I0124 05:48:23.538 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.540 [TaskStatusHandlerImpl, StateMachine] Callback transition LOST to PARTITIONED, allow: false I0124 05:48:23.540 [TaskStatusHandlerImpl, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 state machine transition LOST -> PARTITIONED (not allowed) I0124 05:48:23.540 [TaskStatusHandlerImpl, TaskStateMachine] LOST to PARTITIONED => KILL E0124 05:48:23.540 [TaskStatusHandlerImpl, TaskStateMachine] Illegal state transition attempted: LOST -> PARTITIONED (not allowed) I0124 05:48:23.541 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.542 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] Killing task vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.543357 9749 scheduler.cpp:243] Sending KILL call to http://192.168.33.7:5050/master/api/v1/scheduler W0124 05:48:23.546 [TaskGroupBatchWorker, TaskSchedulerImpl] Failed to look up tasks vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.587267 9751 scheduler.cpp:676] Enqueuing event UPDATE received from http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.593 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 moved to status:PARTITIONED for mesos state: TASK_UNKNOWN I0124 05:48:23.595 [TaskStatusHandlerImpl, StateMachine] Callback transition LOST to PARTITIONED, allow: false I0124 05:48:23.595 [TaskStatusHandlerImpl, StateMachine] vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 state machine transition LOST -> PARTITIONED (not allowed) I0124 05:48:23.595 [TaskStatusHandlerImpl, TaskStateMachine] LOST to PARTITIONED => KILL E0124 05:48:23.595 [TaskStatusHandlerImpl, TaskStateMachine] Illegal state transition attempted: LOST -> PARTITIONED (not allowed) I0124 05:48:23.595 [TaskStatusHandlerImpl, StateManagerImpl] Task being killed: vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.595 [TaskStatusHandlerImpl, VersionedSchedulerDriverService] Killing task vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 I0124 05:48:23.596256 9747 scheduler.cpp:243] Sending KILL call to http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.634858 9751 scheduler.cpp:676] Enqueuing event UPDATE received from http://192.168.33.7:5050/master/api/v1/scheduler I0124 05:48:23.636 [TaskStatusHandlerImpl, TaskStatusHandlerImpl] Task:vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606 moved to status:PARTITIONED for mesos state: TASK_UNKNOWN
      

      Attachments

        Activity

          People

            davmclau David McLaughlin
            santhk Santhosh Kumar Shanmugham
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: