Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.0.0
-
None
-
None
Description
The client gets stuck in the following loop if an rpc its issued to recover a block timed out:
DataStreamer#run 1. processDatanodeError 2. DN#recoverBlock 3. DN#syncBlock 4. NN#nextGenerationStamp 5. sleep 1s 6. goto 1
Once we've timed out onece at step 2 and loop, step 2 throws an IOE because the block is already being recovered and step 4 throws an IOE because the block GS is now out of date (the previous, timed-out, request got a new GS and updated the block). Eventually the client reaches max retries, considers all DNs bad, and close throws an IOE.
The client should be able to succeed if one of its requests to recover the block succeeded. It should still fail if another client (eg HBase via recoverLease or the NN via releaseLease) succesfully recovered the block. One way to handle this would be to not timeout the request to recover the block. Another would be able to make a subsequent call to recoverBlock succeed eg by updating the block's sequence number to be the latest value that was updated by the same client in the previous request (ie it can recover over itself but not another client).