We've seen a busrt of commitBlockSynchronization() calls making namenode unresponsive for a long time, causing other important RPC calls such as leas renewing and heartbeat to fail. Since the blocks are copied, it can also create a lot of cluster-wide traffic.
The commitBlockSynchronization() method logs two messages, one in the beginning after acquiring the write lock and another one after releasing and syncing the edit log. The time between the two is usually less than 1-2 ms, so the actual processing and sync time don't seem long. But when namenode gets a busrt of these calls, it can only sustain the rate of 20-30 per second, with almost no other requests being served. When these calls are served back-to-back, the gap between calls ranges from 20-100ms.
The calls are supposed to be blocked at the write lock. Although enabling fairness is known to causes significant performance degradation on write heavy ReadWriteLock (in my experiment about 80% degradation with 100 threads), the overhead is still very small compared to the wait time of 20-100ms we saw.
Regardless of the performance and efficiency of commitBlockSynchronization(), I think it is reasonable to throttle the block recovery, so that namenode can avoid shooting itself. It will be nice to have a feedback-based dynamic asynchronous work scheduling, but a simple throttling may do for now. I propose configurable rate with 300/min as default.