Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-2639

A client may fail during block recovery even if its request to recover a block succeeds

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.0
    • None
    • hdfs-client
    • None

    Description

      The client gets stuck in the following loop if an rpc its issued to recover a block timed out:

      DataStreamer#run
      1.  processDatanodeError
      2.     DN#recoverBlock
      3.        DN#syncBlock
      4.           NN#nextGenerationStamp
      5.  sleep 1s
      6.  goto 1
      

      Once we've timed out onece at step 2 and loop, step 2 throws an IOE because the block is already being recovered and step 4 throws an IOE because the block GS is now out of date (the previous, timed-out, request got a new GS and updated the block). Eventually the client reaches max retries, considers all DNs bad, and close throws an IOE.

      The client should be able to succeed if one of its requests to recover the block succeeded. It should still fail if another client (eg HBase via recoverLease or the NN via releaseLease) succesfully recovered the block. One way to handle this would be to not timeout the request to recover the block. Another would be able to make a subsequent call to recoverBlock succeed eg by updating the block's sequence number to be the latest value that was updated by the same client in the previous request (ie it can recover over itself but not another client).

      Attachments

        Activity

          People

            Unassigned Unassigned
            eli Eli Collins
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: