Description
In our environment, we run a lot of batch jobs, some of which have tight timeline. If any tasks in the job runs longer than x hours, it does not make sense to run it anymore.
For instance, a team would submit a job which builds a weekly index and repeats every Monday. If the job does not finish before next Monday for whatever reason, there is no point to keep any task running.
We believe that implementing deadline tracking distributed across our cluster makes more sense as it makes the system more scalable and also makes our centralized state machine simpler.
One idea I have right now is to add an optional TimeInfo deadline to TaskInfo field, and all default executors in Mesos can simply terminate the task and send a proper StatusUpdate.