Description
When running some stress tests, I saw a failure where the DURefreshThread failed due to the filesystem changing underneath it:
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/data/4/dfs/dn/current/BP-1928785663-172.20.90.20-1343880685858/current/rbw/blk_4637779214690837894': No such file or directory
(the block was probably finalized while the du process was running, which caused it to fail)
The next block write, then, called getUsed(), and the exception got propagated causing the write to fail. Since it was a pseudo-distributed cluster, the client was unable to pick a different node to write to and failed.
The current behavior of propagating the exception to the next (and only the next) caller doesn't seem well-thought-out.