Details

    • Hadoop Flags:
      Reviewed

      Description

      After HDFS-5585 and HDFS-5583, clients and datanodes can coordinate shutdown-restart in order to minimize failures or locality loss.

      In this jira, HDFS client is made aware of the restart OOB ack and perform special write pipeline recovery. Datanode is also modified to load marked RBW replicas as RBW instead of RWR as long as the restart did not take long.

      For clients, it considers doing this kind of recovery only when there is only one node left in the pipeline or the restarting node is a local datanode. For both clients and datanodes, the timeout or expiration is configurable, meaning this feature can be turned off by setting timeout variables to 0.

      1. HDFS-5924_rebased_with_comments.patch
        32 kB
        Kihwal Lee
      2. HDFS-5924_RBW_RECOVERY.patch
        32 kB
        Kihwal Lee
      3. HDFS-5924_RBW_RECOVERY.patch
        32 kB
        Kihwal Lee

        Issue Links

          Activity

          Hide
          Kihwal Lee added a comment -

          The attached patch includes both client and datanode side changes. The new/modified test cases demonstrate the recovery and the fall-back to the regular pipeline recovery when restart does not happen in time.

          This feature does not guarantee all client writes to continue across restart. The OOB ack is advisory and may not get delivered if network is congested or client/server is extremely slow. Rather than blocking for delivery, the OOB transmission and shutdown times out in order to control the upgrade latency.

          Datanode sends an restart OOB ack only when the write originated from a client, not another datanode. The replicas are recorded in individual restart meta file along with the time stamp indicating expiry. If datanode restart takes longer, the RBW replicas will be loaded normally as RWR for the namenode driven block recovery. Otherwise the marked RBW replicas will be loaded as RBW.

          Because of the controlled shutdown and restart and the way pipeline recovery works, I believe this change has no adverse effect on append recovery.

          Show
          Kihwal Lee added a comment - The attached patch includes both client and datanode side changes. The new/modified test cases demonstrate the recovery and the fall-back to the regular pipeline recovery when restart does not happen in time. This feature does not guarantee all client writes to continue across restart. The OOB ack is advisory and may not get delivered if network is congested or client/server is extremely slow. Rather than blocking for delivery, the OOB transmission and shutdown times out in order to control the upgrade latency. Datanode sends an restart OOB ack only when the write originated from a client, not another datanode. The replicas are recorded in individual restart meta file along with the time stamp indicating expiry. If datanode restart takes longer, the RBW replicas will be loaded normally as RWR for the namenode driven block recovery. Otherwise the marked RBW replicas will be loaded as RBW. Because of the controlled shutdown and restart and the way pipeline recovery works, I believe this change has no adverse effect on append recovery.
          Hide
          Kihwal Lee added a comment -

          The patch applies on top of HDFS-5585 and HDFS-5583.

          Show
          Kihwal Lee added a comment - The patch applies on top of HDFS-5585 and HDFS-5583 .
          Hide
          Kihwal Lee added a comment -

          Did not close the restart meta input. Attaching the revised patch.

          Show
          Kihwal Lee added a comment - Did not close the restart meta input. Attaching the revised patch.
          Hide
          Brandon Li added a comment -

          This feature does not guarantee all client writes to continue across restart.

          Would it cause data loss? especially when the only one or more than one datanode in the pipeline is shutting down for upgrade.

          Show
          Brandon Li added a comment - This feature does not guarantee all client writes to continue across restart. Would it cause data loss? especially when the only one or more than one datanode in the pipeline is shutting down for upgrade.
          Hide
          Kihwal Lee added a comment - - edited

          If the delivery of the OOB ack was not successful due to a network or hardware issue and there was only one replica in the pipeline, the write will fail. This is no worse than the current behavior. Data loss is typically referred to situations where data was successfully written, but a part or all of it becomes unavailable permanently. Here, it is different; the write simply fails.

          In short, OOB acking is used for the smoother upgrade process, but (1) this feature won't block shutdown indefinitely and (2) if an OOB ack is not delivered, things will fall back to the existing non-upgrade behavior.

          Show
          Kihwal Lee added a comment - - edited If the delivery of the OOB ack was not successful due to a network or hardware issue and there was only one replica in the pipeline, the write will fail. This is no worse than the current behavior. Data loss is typically referred to situations where data was successfully written, but a part or all of it becomes unavailable permanently. Here, it is different; the write simply fails. In short, OOB acking is used for the smoother upgrade process, but (1) this feature won't block shutdown indefinitely and (2) if an OOB ack is not delivered, things will fall back to the existing non-upgrade behavior.
          Hide
          Brandon Li added a comment -

          This is no worse than the current behavior.

          The application could experience much higher write failure rate during the datanode upgrade. For example, all datanodes inaccessible in the pipeline is more possible now. Shutting down a datanode after previous shutdown datanode is up might be able to minimize the write failure but would increase the total upgrade time.

          Recall there was some discussion that, after datanode sends OOB, its shutdown can be paused until the client agrees to let it go. I don't remember the reason why not taking that approach.

          Show
          Brandon Li added a comment - This is no worse than the current behavior. The application could experience much higher write failure rate during the datanode upgrade. For example, all datanodes inaccessible in the pipeline is more possible now. Shutting down a datanode after previous shutdown datanode is up might be able to minimize the write failure but would increase the total upgrade time. Recall there was some discussion that, after datanode sends OOB, its shutdown can be paused until the client agrees to let it go. I don't remember the reason why not taking that approach.
          Hide
          Brandon Li added a comment -

          With some more thinking, I feel pause-datanode-shutdown might not be a generally better approach since it can have unbounded upgrade time along with more complicated communication between client and datanode (e.g., may require multiple repeated OOB at different pipeline phases).

          Read failure can still happen regardless how write is handled during datanode upgrade. Selectively shutting down datanode for upgrade (e.g., randomly one by one instead of rack by rack) might help reduce the I/O failure.

          Show
          Brandon Li added a comment - With some more thinking, I feel pause-datanode-shutdown might not be a generally better approach since it can have unbounded upgrade time along with more complicated communication between client and datanode (e.g., may require multiple repeated OOB at different pipeline phases). Read failure can still happen regardless how write is handled during datanode upgrade. Selectively shutting down datanode for upgrade (e.g., randomly one by one instead of rack by rack) might help reduce the I/O failure.
          Hide
          Kihwal Lee added a comment -

          Batch upgrades and upgrade speed

          Shutting down a datanode after previous shutdown datanode is up might be able to minimize the write failure but would increase the total upgrade time.

          The upgrade batch size has come up multiple times in the past discussions. The conclusion is that DN rolling upgrades shouldn't be done in batch in order to minimize data loss. When upgrading a big cluster in the traditional way, a number of DNs don't come back online, mainly due to latent hardware issues. This results in missing blocks and manual recovery is necessary. Most cases are recoverable by fixing hardware and manually copying block files, but not all. Sometimes a number of blocks are permanently lost. Since the block placement policy is not sophisticated enough to consider failure domains, any several simultaneous permanent node failures can cause this.

          During DN rolling upgrades, admins should watch out for data availability issues caused by failed restart. Data loss from permanent failures can only be avoided if only 1-2 nodes are upgraded at a time. If the normal failure rate of non-upgrading nodes, one soon realizes that upgrading one at a time is the preferred way.

          The upgrade timing requirement (req-3) in the design doc was specified based on the serial DN upgrade scenario for this reason.

          Regarding the upgrade speed, it largely depends on the number of blocks on each DN. The restart of a DN with about half million blocks in 4-6 volumes used to take minutes. After a number of optimizations and HDFS-5498, it can come back up in about 10 seconds.

          Write pipeline recovery during DN upgrades

          There are three things a client can do when a node in the pipeline is being upgraded/restarted:

          1. Exclude the node and continue: this is just like the regular pipeline recovery. If the pipeline has enough number of nodes, clients can do this. Even if there is no OOB ack, majority of writers will survive this way as long as they don't hit double/triple failures and upgrades combined. But single replica writes will fail.
          2. Copy block from the upgrading node to a new one and continue: this is a variation of the "add additional node" recovery path of the pipeline recovery. The client will ask for an additional node from NN and then issue a copy command with the source as the upgrading node. This does not add any value unless there is only one nodes in the pipeline. Writes with only one replica will fail if the copy cannot be carried out.
          3. Wait until the node comes back: this is an alternative to the above approach. It avoids extra data movement and maintains the locality. I've talked to HBase guys and they said they would prefer this. Writes with only one replica will fail if the restart does not happen in time.

          If min_replica is set to 2 for a cluster, (1) may take care of service availability side of the issue. I.e. all writes succeeds if nodes are upgraded one by one. There are use cases, however, require min_replica to be 1, so we need a way to make single replica writes survive upgrades.

          A single replica write may be a write that was initiated with one replica from the beginning or may be a result of node failures during the write. The former is usually considered okay to fail, since users usually understand the risk of the replication factor of 1 and make the choice. The more problematic case is the latter; the ones started with more than one replicas but currently having only one due to node failures. I believe (2) or (3) can solve most of these cases, but neither is ideal.

          In theory, (2) sounds useful, but there are some difficulties in realizing it. First, a DN has to stop accepting new requests, yet allow copy requests to be allowed. This means DataXceiverServer's server socket cannot be closed until copy requests are received. Since the knowledge about a specific client (writer) is only known to each DataXceiver thread and the copy command will spawn another thread on DN, coordinating this is not simple. Secondly, DN should be able to detect when it is safe to shutdown. To get it correct, it has to be told by the clients. A slow client can also slow down the progress. In the end, the extra coordination and dependencies are not exactly cheap.

          The mechanism in (3) works fine with no additional run-time traffic overhead and no dependencies on clients. Timing-wise it is also more deterministic. The downside is that if the datanode does not come back up in time, outstanding writes will timeout and fail, if the node was the only one left in the pipeline. In addition to making writes continue, this mechanism allows locality to be preserved, so (2) cannot substitute (3) completely.

          I suggest an additional change to be made in order to keep the mechanism simpler, but address its deficiency.

          Suggested improvement

          While using (3) for write pipeline recovery during upgrade-restarts, we can reduce the chance of permanent failures for writers with the replica count reduced to one due to prior node failures. If a block write started with no replication (ie. single replica) by the user, it is assumed that the user understands the risk of higher possibility of read or write failures. Thus we focus on the cases where the specified replication factor is greater than one.

          The datanode replacement policy is defined in ReplaceDatanodeOnFailure. If the default policy is modified to maintain two replicas at minimum for r > 1, the chance of write failures for r > 1 during DN rolling upgrades will be greatly reduced. Note that this does not incur additional replication traffic.

          If you think this proposal makes sense, I will file a separate jira to update ReplaceDatanodeOnFailure.

          Show
          Kihwal Lee added a comment - Batch upgrades and upgrade speed Shutting down a datanode after previous shutdown datanode is up might be able to minimize the write failure but would increase the total upgrade time. The upgrade batch size has come up multiple times in the past discussions. The conclusion is that DN rolling upgrades shouldn't be done in batch in order to minimize data loss. When upgrading a big cluster in the traditional way, a number of DNs don't come back online, mainly due to latent hardware issues. This results in missing blocks and manual recovery is necessary. Most cases are recoverable by fixing hardware and manually copying block files, but not all. Sometimes a number of blocks are permanently lost. Since the block placement policy is not sophisticated enough to consider failure domains, any several simultaneous permanent node failures can cause this. During DN rolling upgrades, admins should watch out for data availability issues caused by failed restart. Data loss from permanent failures can only be avoided if only 1-2 nodes are upgraded at a time. If the normal failure rate of non-upgrading nodes, one soon realizes that upgrading one at a time is the preferred way. The upgrade timing requirement (req-3) in the design doc was specified based on the serial DN upgrade scenario for this reason. Regarding the upgrade speed, it largely depends on the number of blocks on each DN. The restart of a DN with about half million blocks in 4-6 volumes used to take minutes. After a number of optimizations and HDFS-5498 , it can come back up in about 10 seconds. Write pipeline recovery during DN upgrades There are three things a client can do when a node in the pipeline is being upgraded/restarted: Exclude the node and continue: this is just like the regular pipeline recovery. If the pipeline has enough number of nodes, clients can do this. Even if there is no OOB ack, majority of writers will survive this way as long as they don't hit double/triple failures and upgrades combined. But single replica writes will fail. Copy block from the upgrading node to a new one and continue: this is a variation of the "add additional node" recovery path of the pipeline recovery. The client will ask for an additional node from NN and then issue a copy command with the source as the upgrading node. This does not add any value unless there is only one nodes in the pipeline. Writes with only one replica will fail if the copy cannot be carried out. Wait until the node comes back: this is an alternative to the above approach. It avoids extra data movement and maintains the locality. I've talked to HBase guys and they said they would prefer this. Writes with only one replica will fail if the restart does not happen in time. If min_replica is set to 2 for a cluster, (1) may take care of service availability side of the issue. I.e. all writes succeeds if nodes are upgraded one by one. There are use cases, however, require min_replica to be 1, so we need a way to make single replica writes survive upgrades. A single replica write may be a write that was initiated with one replica from the beginning or may be a result of node failures during the write. The former is usually considered okay to fail, since users usually understand the risk of the replication factor of 1 and make the choice. The more problematic case is the latter; the ones started with more than one replicas but currently having only one due to node failures. I believe (2) or (3) can solve most of these cases, but neither is ideal. In theory, (2) sounds useful, but there are some difficulties in realizing it. First, a DN has to stop accepting new requests, yet allow copy requests to be allowed. This means DataXceiverServer's server socket cannot be closed until copy requests are received. Since the knowledge about a specific client (writer) is only known to each DataXceiver thread and the copy command will spawn another thread on DN, coordinating this is not simple. Secondly, DN should be able to detect when it is safe to shutdown. To get it correct, it has to be told by the clients. A slow client can also slow down the progress. In the end, the extra coordination and dependencies are not exactly cheap. The mechanism in (3) works fine with no additional run-time traffic overhead and no dependencies on clients. Timing-wise it is also more deterministic. The downside is that if the datanode does not come back up in time, outstanding writes will timeout and fail, if the node was the only one left in the pipeline. In addition to making writes continue, this mechanism allows locality to be preserved, so (2) cannot substitute (3) completely. I suggest an additional change to be made in order to keep the mechanism simpler, but address its deficiency. Suggested improvement While using (3) for write pipeline recovery during upgrade-restarts, we can reduce the chance of permanent failures for writers with the replica count reduced to one due to prior node failures. If a block write started with no replication (ie. single replica) by the user, it is assumed that the user understands the risk of higher possibility of read or write failures. Thus we focus on the cases where the specified replication factor is greater than one. The datanode replacement policy is defined in ReplaceDatanodeOnFailure . If the default policy is modified to maintain two replicas at minimum for r > 1, the chance of write failures for r > 1 during DN rolling upgrades will be greatly reduced. Note that this does not incur additional replication traffic. If you think this proposal makes sense, I will file a separate jira to update ReplaceDatanodeOnFailure .
          Hide
          Vinayakumar B added a comment -

          I think approach (3) is better, anyway fallback to normal pipeline recovery also will be done once timeout occurs.

          By seeing the patch attached, waiting will happen if only local datanode is restarted for upgrade. Is that enough?
          no need to wait if other node in the pipeline is being upgraded.?

          Show
          Vinayakumar B added a comment - I think approach (3) is better, anyway fallback to normal pipeline recovery also will be done once timeout occurs. By seeing the patch attached, waiting will happen if only local datanode is restarted for upgrade. Is that enough? no need to wait if other node in the pipeline is being upgraded.?
          Hide
          Kihwal Lee added a comment -

          By seeing the patch attached, waiting will happen if only local datanode is restarted for upgrade. Is that enough? no need to wait if other node in the pipeline is being upgraded.?

          It will wait for a non-local datanode if the node is the only one in the write pipeline. Otherwise, clients will simply apply the normal pipeline recovery procedure, which will take less time. The abandoned replica will get block-reported when the datanode restarts. If the client did pipeline recovery, NN can mark it as corrupt right away since the generation number wouldn't match. If the report comes before the pipeline recovery, it will be added in the replica list of the BlockInfoUnderConstruction, if not already there. When the pipeline recovery performed by the client, the outdated replica will get removed.

          Show
          Kihwal Lee added a comment - By seeing the patch attached, waiting will happen if only local datanode is restarted for upgrade. Is that enough? no need to wait if other node in the pipeline is being upgraded.? It will wait for a non-local datanode if the node is the only one in the write pipeline. Otherwise, clients will simply apply the normal pipeline recovery procedure, which will take less time. The abandoned replica will get block-reported when the datanode restarts. If the client did pipeline recovery, NN can mark it as corrupt right away since the generation number wouldn't match. If the report comes before the pipeline recovery, it will be added in the replica list of the BlockInfoUnderConstruction, if not already there. When the pipeline recovery performed by the client, the outdated replica will get removed.
          Hide
          Brandon Li added a comment -

          The suggested improvement sounds good to me. A couple more comments for the patch:

          • By default dfs.client.datanode-restart.timeout is 30 seconds. However, 4 second is hardcoded in the code as the maximum delay. 30 seconds here may confuse users.
          • Should DFS_DATANODE_RESTART_REPLICA_EXPIRY_DEFAULT be less than min(DFS_CLIENT_DATANODE_RESTART_TIMEOUT_DEFAULT, 4 seconds) ? If datanode takes too long to start up, it loses the chance to be included in the original pipleline.
          Show
          Brandon Li added a comment - The suggested improvement sounds good to me. A couple more comments for the patch: By default dfs.client.datanode-restart.timeout is 30 seconds. However, 4 second is hardcoded in the code as the maximum delay. 30 seconds here may confuse users. Should DFS_DATANODE_RESTART_REPLICA_EXPIRY_DEFAULT be less than min(DFS_CLIENT_DATANODE_RESTART_TIMEOUT_DEFAULT, 4 seconds) ? If datanode takes too long to start up, it loses the chance to be included in the original pipleline.
          Hide
          Vinayakumar B added a comment -

          By default dfs.client.datanode-restart.timeout is 30 seconds. However, 4 second is hardcoded in the code as the maximum delay. 30 seconds here may confuse users.

          I think you miss read. this max 4 seconds is the check interval for the datanode up. But actual deadline (30sec) will be calculated when OOB response is received. right? Am I missing anything here?

          Show
          Vinayakumar B added a comment - By default dfs.client.datanode-restart.timeout is 30 seconds. However, 4 second is hardcoded in the code as the maximum delay. 30 seconds here may confuse users. I think you miss read. this max 4 seconds is the check interval for the datanode up. But actual deadline (30sec) will be calculated when OOB response is received. right? Am I missing anything here?
          Hide
          Kihwal Lee added a comment -

          As Vinay observed, 4 second is the retry interval. I will try improving comments to make it more clear.

          Show
          Kihwal Lee added a comment - As Vinay observed, 4 second is the retry interval. I will try improving comments to make it more clear.
          Hide
          Kihwal Lee added a comment -

          ... I will file a separate jira to update ReplaceDatanodeOnFailure.

          HDFS-6016 has been filed.

          Show
          Kihwal Lee added a comment - ... I will file a separate jira to update ReplaceDatanodeOnFailure. HDFS-6016 has been filed.
          Hide
          Brandon Li added a comment -

          Thank you, Kihwal Lee and [~vinayakumarb], for clarifying it.
          +1, the patch looks good to me.

          Show
          Brandon Li added a comment - Thank you, Kihwal Lee and [~vinayakumarb] , for clarifying it. +1, the patch looks good to me.
          Hide
          Kihwal Lee added a comment -

          The new patch adds more comments to clarify the 4 second timeout is the retry interval. Also rebased the patch to the current branch. There were contextual differences in the beginning of DFSClient.java.

          Show
          Kihwal Lee added a comment - The new patch adds more comments to clarify the 4 second timeout is the retry interval. Also rebased the patch to the current branch. There were contextual differences in the beginning of DFSClient.java.
          Hide
          Kihwal Lee added a comment -

          Thanks for the reviews, Vinay and Brandon. I've committed this to the RU branch.

          Show
          Kihwal Lee added a comment - Thanks for the reviews, Vinay and Brandon. I've committed this to the RU branch.

            People

            • Assignee:
              Kihwal Lee
              Reporter:
              Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development