Nigel - This patch proved very very hard to test without mock-objects. For now, I've attached a slightly arbitrary test-case which checks does the following:
- Simulates a very large cluster by setting a very high value of 30s for the heartbeat-interval between the JobTracker and TaskTracker.
- Switches on the out-of-band heartbeat for the cluster.
- Submits a very small random-writer job with 2 maps and asserts that the job completes within 120s.
The 120s deadline is carefully chosen with the idea that a randomwriter job with 2 maps will need at least 4 heartbeats: setup-task, map_0, map_1 and cleanup-task. However this is still arbitrary and not very scientific. So, should we commit this test-case given that it is slightly flaky? Thoughts?
PS: The job completes in ~50s with out-of-band heartbeats turned on, and in ~3mins with it turned off. FYI