if there is a total network partition - then we don't have a problem. either the cluster will fail outright (let's say JT and NN land up on different sides of the partition) - or one partition (the one that has the JT/NN) will exclude nodes from the other. (i say we don't have a problem in the sense that the response of hadoop to such an event is more or less correct).
The problem is that we have had occurences of slow networks that are not quite partitioned. For example the uplink from one rack switch to the core switch can be flaky/degraded. in this case - control traffic from the JT to the TTs may be going through - but data traffic from mappers and reducers on the degraded racks can be really hurt. If there are problems in the core switch itself (it's underprovisioned) - then the whole cluster is having network problems. The description applies to such scenarios.
In such a case - the appropriate response of the software should be, at worst, degraded performance (in keeping with the degraded nature of the underlying hardware) or at best, correctly identifying the the slow node(s) and not using them or using them less (this would apply to the flaky rack uplink scenario). The current response of Hadoop is neither. It makes a bad situation worse by misassigning blame (when map nodes on good racks are blamed by sufficiently large number of reducers running on bad racks). We potentially lose nodes from good racks and the resultant retry of tasks puts further stress on the strained network resource.
A couple of things seem desirable:
1. for enterprise data center environments that (may) have high degree of control and monitoring around their networking elements - the ability to turn off (selectively) the
functionality in hadoop that tries to detect and correct for network problems. Diagnostics stands a much better chance to catch/identify networking problems and fix them.
2. in environments with less control (say Amazon EC2 or hadoop running on a bunch of PCs across a company) that are more akin to a p2p network - hadoop's network fault diagnosis algorithms need improvement. A comparison to bittorrent is fair - over there every node advertises it's upload/download throughput and a node can come across as slow only in comparison to the collective stats published by all peers (and not just based on communication with a small set of other peers).