To Support fault tolerant, we would like to use state machine to control the system state transitions.
After driver is created, it will start from request evaluators and submit contexts state; after all the contexts are ready, it will move to submitting tasks state; when all the tasks are start running, it moves to tasks running state; when all the tasks are completed, the state will be changed to tasks completed. If either tasks or evaluators fail, it will change to shut down state, etc.
Here are the proposed system states:
Here are the event that may trigger the state change: