The current testing framework for the scheduler only tests individual classes in isolation: DAGSchedulerSuite, TaskSchedulerImplSuite, etc. Of course that is useful, but we are missing tests which cover the interaction between these components. We also have larger tests which run entire spark jobs, but that doesn't allow fine grained control of failures for verifying spark's fault-tolerance.
Adding a framework for testing the scheduler as a whole will:
1. Allow testing bugs which involve the interaction between multiple parts of the scheduler, eg. SPARK-10370
2. Greater confidence in refactoring the scheduler as a whole. Given the tight coordination between the components its hard to consider any refactoring, since it would be unlikely to be covered by any tests.
3. Make it easier to increase test coverage. Writing tests for the DAGScheduler now requires intimate knowledge of exactly how the components fit together – a lot of work goes into mimicking the appropriate behavior of the other components. Furthermore, it makes the tests harder to understand for the un-initiated – which parts are simulating some condition of an external system (eg., losing an executor), and which parts are just interaction with other parts of the scheduler (eg., task resubmission)? These tests will allow to just work at the level of the interaction w/ the executors – tasks complete, tasks fail, executors are lost, etc.