I started looking at the patch. Unfortunately, I think the current algorithm makes assumptions about how the scheduler works. So, while it works perfectly well for the CapacityTaskScheduler, it may not work correctly with the FairshareScheduler, because the latter removes jobs it maintains per pool lazily. Hence, there may be a case where the number of jobs returned by getJobsFromQueue is non-zero, but it doesn't mean the current job is submitted.
I think there is already an assumption that this test is run independently on a cluster, because it kills tasktrackers etc and could affect other jobs if they are run in parallel. For the same reason, jobs within the reliability test are run one after the other. So, wouldn't it be right to use jobsToComplete instead of getJobsFromQueue and as long as this is non-zero, we can assume it is the job most recently submitted ?
Some other minor points:
- Can we update the documentation to say how the reliability test should be run ? For instance, we have to run it on a cluster that is not running other jobs, as stated above.
- Also, I would suggest we fail noisily if the last job we get is not in the PREP or RUNNING state, so that we wouldn't have false positive runs of the MRReliabiliy test.