I think this approach might work ok for now. It makes sure the data node is not marked dead. But this should be considered mostly a work around. We should note the fundamental problem still remains (a little less lethal). e.g. a) new blocks are not reported, b) no new blocks can be written during this time c) (not sure) not blocks can be read? etc.
If all the nodes are taking very long to process the block report, many operations on HDFS will fail. An admin can increase the block report period to reduce the effect of this problem. The current fix works fine for occasional delays.
> In step 4. should we wait for receiving a command or for receiving another block?
both would be better.
> In OfferService we process all the commands that are in the queue at once. Do you see any issues with it?
Not fundamentally different. One main issue would be that there might be thousands of blocks to delete sometimes.. But that is same problem as long block report.
Regd more complete fix, I could file another jira to propose a fix that I discussed with Sameer and Hairong, that satisfies all the requirements on current block report.