Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Currently before the first status update of a successful task launch is sent, the steps include filesystem imagine provisioning, artifact fetching whose duration highly depends on the tasks and not the performance of "the infrastructure", i.e., Mesos stack, host load or other problems, etc.
Ideally the scheduler would be able to set of a timeout on such delay excluding the time spent on FS provisioning and artifact fetching so it can relaunch the task somewhere else instead of waiting indefinitely.
TASK_STARTING wouldn't work for this purpose because it's sent only after the executor is registered.
We can actually just have the agent send TASK_STAGING. Its TaskStatus.source = SOURCE_SLAVE and TaskStatus.reason = null can help the scheduler distinguish it from the updates as a result of reconciliation. Creating a new state for this feels unncessary?