I specifically targeted the HDFS framework in this bug primarily because the MR framework issues are actually worse. There is a very good chance that if you have multiple disks, you have swap spread across those disks. In the case of drive failure, this means you lose a chunk of swap. Loss of swap==less memory for streaming jobs==job failure in many, many instances. So let's not get distracted with the issues around MR, job failure, job speed, etc.
What I'm seeing is that at any given time we have 10-20% of our nodes down. The vast majority have a single failed disk. This means we're leaving capacity on the floor, waiting for a drive replacement. Why can't these machines just stay up, providing blocks and providing space on the good drives? For large clusters, this might be a minor inconvenience but for small clusters this could be deadly.
The current fix is done with wetware, a source of additional strain on traditionally overloaded operations teams. Random failure times vs. letting the ops team decide when a data node goes down? This seems like a no brainer from a practicality perspective. Yes, this is clearly more difficult than just killing the node completely. But over the long haul, it is going to be cheaper in human labor to fix this in Hadoop than to throw more admins at it.