Batch upgrades and upgrade speed
Shutting down a datanode after previous shutdown datanode is up might be able to minimize the write failure but would increase the total upgrade time.
The upgrade batch size has come up multiple times in the past discussions. The conclusion is that DN rolling upgrades shouldn't be done in batch in order to minimize data loss. When upgrading a big cluster in the traditional way, a number of DNs don't come back online, mainly due to latent hardware issues. This results in missing blocks and manual recovery is necessary. Most cases are recoverable by fixing hardware and manually copying block files, but not all. Sometimes a number of blocks are permanently lost. Since the block placement policy is not sophisticated enough to consider failure domains, any several simultaneous permanent node failures can cause this.
During DN rolling upgrades, admins should watch out for data availability issues caused by failed restart. Data loss from permanent failures can only be avoided if only 1-2 nodes are upgraded at a time. If the normal failure rate of non-upgrading nodes, one soon realizes that upgrading one at a time is the preferred way.
The upgrade timing requirement (req-3) in the design doc was specified based on the serial DN upgrade scenario for this reason.
Regarding the upgrade speed, it largely depends on the number of blocks on each DN. The restart of a DN with about half million blocks in 4-6 volumes used to take minutes. After a number of optimizations and
HDFS-5498, it can come back up in about 10 seconds.
Write pipeline recovery during DN upgrades
There are three things a client can do when a node in the pipeline is being upgraded/restarted:
- Exclude the node and continue: this is just like the regular pipeline recovery. If the pipeline has enough number of nodes, clients can do this. Even if there is no OOB ack, majority of writers will survive this way as long as they don't hit double/triple failures and upgrades combined. But single replica writes will fail.
- Copy block from the upgrading node to a new one and continue: this is a variation of the "add additional node" recovery path of the pipeline recovery. The client will ask for an additional node from NN and then issue a copy command with the source as the upgrading node. This does not add any value unless there is only one nodes in the pipeline. Writes with only one replica will fail if the copy cannot be carried out.
- Wait until the node comes back: this is an alternative to the above approach. It avoids extra data movement and maintains the locality. I've talked to HBase guys and they said they would prefer this. Writes with only one replica will fail if the restart does not happen in time.
If min_replica is set to 2 for a cluster, (1) may take care of service availability side of the issue. I.e. all writes succeeds if nodes are upgraded one by one. There are use cases, however, require min_replica to be 1, so we need a way to make single replica writes survive upgrades.
A single replica write may be a write that was initiated with one replica from the beginning or may be a result of node failures during the write. The former is usually considered okay to fail, since users usually understand the risk of the replication factor of 1 and make the choice. The more problematic case is the latter; the ones started with more than one replicas but currently having only one due to node failures. I believe (2) or (3) can solve most of these cases, but neither is ideal.
In theory, (2) sounds useful, but there are some difficulties in realizing it. First, a DN has to stop accepting new requests, yet allow copy requests to be allowed. This means DataXceiverServer's server socket cannot be closed until copy requests are received. Since the knowledge about a specific client (writer) is only known to each DataXceiver thread and the copy command will spawn another thread on DN, coordinating this is not simple. Secondly, DN should be able to detect when it is safe to shutdown. To get it correct, it has to be told by the clients. A slow client can also slow down the progress. In the end, the extra coordination and dependencies are not exactly cheap.
The mechanism in (3) works fine with no additional run-time traffic overhead and no dependencies on clients. Timing-wise it is also more deterministic. The downside is that if the datanode does not come back up in time, outstanding writes will timeout and fail, if the node was the only one left in the pipeline. In addition to making writes continue, this mechanism allows locality to be preserved, so (2) cannot substitute (3) completely.
I suggest an additional change to be made in order to keep the mechanism simpler, but address its deficiency.
While using (3) for write pipeline recovery during upgrade-restarts, we can reduce the chance of permanent failures for writers with the replica count reduced to one due to prior node failures. If a block write started with no replication (ie. single replica) by the user, it is assumed that the user understands the risk of higher possibility of read or write failures. Thus we focus on the cases where the specified replication factor is greater than one.
The datanode replacement policy is defined in ReplaceDatanodeOnFailure. If the default policy is modified to maintain two replicas at minimum for r > 1, the chance of write failures for r > 1 during DN rolling upgrades will be greatly reduced. Note that this does not incur additional replication traffic.
If you think this proposal makes sense, I will file a separate jira to update ReplaceDatanodeOnFailure.