Andrew Wang : Can you comment on how riding out DN restarts interacts with the HBase MTTR work? I know they've done a lot of work to reduce timeouts throughout the stack, and riding out restarts sounds like we need to keep the timeouts up. It might help to specify your target restart time, for example with a DN with 500k blocks.
There are multiple factors that affect the DN upgrade latency. First, shutdown can take up to about 2 seconds. The time is bounded so that slow clients/network cannot delay the shutdown. That is, the shutdown notification (via the OOB acks) is advisory and may not reach all clients. HDFS does not provide a integrated tool for software/config upgrade & restart. The process will be different depending on how hadoop and its config is deployed. Whatever the method is, it will add a bit of delay to the whole process.
The actual DN startup time varies depending on the number of blocks and the number of disks (typically corresponds to # of volumes). Before
HDFS-5498, DN would run "du" on each volume serially and also list and stat block and meta files serially. Now "du" doesn't have to run if restarted within 5 minutes of shutdown. Also block/meta scans are done concurrently if multiple volumes are present. So, the startup time will be shorter if blocks are spread across more disks. In a test carried out in the beginning of HDFS-5498, the restart time of a node shrunk from over a minute to about 12 seconds. We could make restart even shorter by saving more state on shutdown.
The final step before being able to serve client is the registration. If the block token is used, DNs won't be able to validate them until registration with NN is complete. This usually happens immediately after scanning and adding blocks. We could save the shared secret before shutdown, but obviously it can be a security risk.
I will be more than happy to further improve the DN restart latency, if it becomes major hurdle for HBase.
stack : + "3. Clients" It says "For shutdown, clients may choose to copy partially written replica to another node..." The DFSClient would do this internally (or 'may' do this)?
This is done internally. This is what happens to clients with # replicas >= 3 today when node failures leave only (#replica / 2) nodes in the pipeline. If the restarting node is not local or more than only nodes are in the pipeline, the client will simply exclude the restarting node and , if necessary, add more nodes to the pipeline and copy partially written data over to the new node. If the restarting node is the local node or is the only node in the pipeline, client will wait for the node to be restarted. There is a configurable timeout for the wait.
If the restarting node was the only node in the pipeline, the write will fail after the restart timeout. This is very unlikely to blocks with 3 or more replicas due to the DN replacement policy. For blocks being written with replication factor of 2 can, however, suffer, if the writer has already lost one node due to a failure and loses another because of restart failure. To address this issue,
HDFS-6016 has been filed. However, this is a non-blocking issue for the rolling upgrade feature.
Regarding use of shortened read/write socket timeout, I can see the use cases favoring failing over to a slower node than waiting longer for the same fail-over or recovery, if lucky. Reads will fail over to a different source during a DN restart. For writes, the restart timeout is independent of the socket write timeout, so the client may block for more than what the user wants to. If that is undesirable, the configuration should be changed so that DFSOutputStream can timeout on restart much quicker. The default is currently 30 seconds.