This patch changes the AM to restart a map task if 50% of the shuffling reducers report errors for a given map task instead of 50% of the running reducers.
It also changes how often a reduce reports fetch failures. If a ConnectionException happens then it will report the error immediately and will not wait. A ConnectionException indicates that there is no one listening on the remote port. This is very different from a timeout where the port is overrun and no one is able to get through.
It also adds in a maximum delay between fetch retries. In the original code every time a fetch failure happened the reducer would add 30% to the delay. It would also only report every 10th failure. This means that the first failure would be reported after about 6 min, the second after 90 min and the third after 20 hours. This is really bad when there is only one reducer because the AM requires at least three reports for the map task to be restarted.
The default maximum delay is set to 1 min which would change the numbers to be 6 min, 15 min, and 25 min respectively. 25 min still seems very long to wait, but is much better then 20 hours.