@Dhruba: I agree that this is not a blocker for 0.19. The out of phase thread deaths don't occur typically in real deployments. Also we haven't yet observed this condition occurring frequently on our grids.
However, I think there are real deficiencies in error recovery for HDFS writes.
- the client does not correctly detect which link in the write pipeline failed
- the client tries to initiate block recovery from the dead Datanode, fails to do so and causes the write to fail. This is mostly due to 1. but can also occur if the recovery primary fails following a link failure.
Ideally, a writer should fail only if
- the writer itself dies for some reason
- the writer loses all it's replicas
This should be the subject of a different JIRA but I think we should spend some energy making it happen. For this issue, the best course might be to disable testSimple until we have a complete recovery story.