Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Add an ability for data nodes to send an OOB response in order to indicate an upcoming upgrade-restart. Client should ignore the pipeline error from the node for a configured amount of time and try reconstruct the pipeline without excluding the restarted node. If the node does not come back in time, regular pipeline recovery should happen.

      This feature is useful for the applications with a need to keep blocks local. If the upgrade-restart is fast, the wait is preferable to losing locality. It could also be used in general instead of the draining-writer strategy.

      1. HDFS-5583.patch
        29 kB
        Kihwal Lee
      2. HDFS-5583.patch
        31 kB
        Kihwal Lee
      3. HDFS-5583.patch
        31 kB
        Kihwal Lee
      4. HDFS-5583.patch
        31 kB
        Kihwal Lee

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          Kihwal, thanks for the fast responses!

          Show
          Tsz Wo Nicholas Sze added a comment - Kihwal, thanks for the fast responses!
          Hide
          Kihwal Lee added a comment -

          HDFS-6015 has been filed for the test failure. Patches are available on both HDFS-6014 and HDFS-6015.

          Show
          Kihwal Lee added a comment - HDFS-6015 has been filed for the test failure. Patches are available on both HDFS-6014 and HDFS-6015 .
          Hide
          Kihwal Lee added a comment -

          HDFS-6014 has been filed for fixing the two findbug warnings.

          Show
          Kihwal Lee added a comment - HDFS-6014 has been filed for fixing the two findbug warnings.
          Show
          Tsz Wo Nicholas Sze added a comment - The failure of TestBlockRecovery in the same build may also be related this. https://builds.apache.org/job/PreCommit-HDFS-Build/6225//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockRecovery/testRaceBetweenReplicaRecoveryAndFinalizeBlock/
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Kihwal, there are two findbugs warnings related to the patch. Could you take a look?
          https://builds.apache.org/job/PreCommit-HDFS-Build/6225/artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Kihwal, there are two findbugs warnings related to the patch. Could you take a look? https://builds.apache.org/job/PreCommit-HDFS-Build/6225/artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Hide
          Kihwal Lee added a comment -

          Thanks, Brandon. I've committed this to the RU branch.

          Show
          Kihwal Lee added a comment - Thanks, Brandon. I've committed this to the RU branch.
          Hide
          Brandon Li added a comment -

          +1. The patch looks good to me.

          Show
          Brandon Li added a comment - +1. The patch looks good to me.
          Hide
          Kihwal Lee added a comment -

          The new patch changes the variable name from "restarting" to "shutdownForUpgrade". An unused import has been removed from DataXceiverServer.java.

          Show
          Kihwal Lee added a comment - The new patch changes the variable name from "restarting" to "shutdownForUpgrade". An unused import has been removed from DataXceiverServer.java.
          Hide
          Brandon Li added a comment -

          Kihwal Lee, the rest of the patch looks good.

          Show
          Brandon Li added a comment - Kihwal Lee , the rest of the patch looks good.
          Hide
          Kihwal Lee added a comment - - edited

          Thanks for the review, Brandon.

          • The admin wants to know whether the command was received by the datanode: This is determined by the return code of the command. As with other commands, when the return code is not 0, the state is non-deterministic and only then the command may be reissued. I do not believe that this is a common case. Moreover, the shutdown normally takes less than two seconds and probably the reissuing shutdown manually takes more than that. In my opinion, adding support for reporting progress won't have much value. If you still feel that it needs to be changed, I will change it. Please let me know what you think.
          • I am planning on adding at least one more OOB ack type in near future for write draining, which will be useful for decommissioining. The reserved enums make certain checks more efficient.

          I will address the rest of the comments when you finish the review.

          Show
          Kihwal Lee added a comment - - edited Thanks for the review, Brandon. The admin wants to know whether the command was received by the datanode: This is determined by the return code of the command. As with other commands, when the return code is not 0, the state is non-deterministic and only then the command may be reissued. I do not believe that this is a common case. Moreover, the shutdown normally takes less than two seconds and probably the reissuing shutdown manually takes more than that. In my opinion, adding support for reporting progress won't have much value. If you still feel that it needs to be changed, I will change it. Please let me know what you think. I am planning on adding at least one more OOB ack type in near future for write draining, which will be useful for decommissioining. The reserved enums make certain checks more efficient. I will address the rest of the comments when you finish the review.
          Hide
          Brandon Li added a comment -

          Some early comments. I haven't finish viewing all the changes.

          • In DataNode#shutdownDatanode() can be called only once, and throws exception for the next invocations.
            I would imagine that after administrator issues "dfsadmin shutdownDatanode -upgrade"command, he/she would like to know if the DataNodes received it and if they are in upgrade preparation state. Unless I missed something, it seems the only way to know it is to issue the same command again and expect to receive an exception. Would it be better to either let shutdownDatanode return an error code or have getDataNodeInfo include current datanode state?
          • Do we plan to have more OOB Ack anytime soon? We can always add new enums instead of reserving a few OOB_RESERVEDx for now.
          • In DataNode.java: is "forUpgrade", "upgrade" or "shutdownForUpgrade" a better name than the variable name "restarting"?
          • DataXceiverServer.java: please clean the unused import
          Show
          Brandon Li added a comment - Some early comments. I haven't finish viewing all the changes. In DataNode#shutdownDatanode() can be called only once, and throws exception for the next invocations. I would imagine that after administrator issues "dfsadmin shutdownDatanode -upgrade"command, he/she would like to know if the DataNodes received it and if they are in upgrade preparation state. Unless I missed something, it seems the only way to know it is to issue the same command again and expect to receive an exception. Would it be better to either let shutdownDatanode return an error code or have getDataNodeInfo include current datanode state? Do we plan to have more OOB Ack anytime soon? We can always add new enums instead of reserving a few OOB_RESERVEDx for now. In DataNode.java: is "forUpgrade", "upgrade" or "shutdownForUpgrade" a better name than the variable name "restarting"? DataXceiverServer.java: please clean the unused import
          Hide
          Brandon Li added a comment -

          Sure. I will review it.

          Show
          Brandon Li added a comment - Sure. I will review it.
          Hide
          Kihwal Lee added a comment -

          Brandon Li, Would you review this and HDFS-5924?

          Show
          Kihwal Lee added a comment - Brandon Li , Would you review this and HDFS-5924 ?
          Hide
          Kihwal Lee added a comment -

          I left one debugging log line left in the code in the patch. The new patch removes it. Otherwise there is no difference.

          Show
          Kihwal Lee added a comment - I left one debugging log line left in the code in the patch. The new patch removes it. Otherwise there is no difference.
          Hide
          Kihwal Lee added a comment -

          This patch triggers sending of the restart OOB ack to clients who are currently writing data.

          The shutdown ordering and timing have been adjusted to give enough time for DataXceiver threads (serving writes) to send the restart OOB ack upstream. First, DataXceiverServer is interrupted and in turn each DataXceiver threads are interrupted by it after closing the server socket to prevent further client connections. Idling DataXceiver threads due to keepalive will simply terminate.

          If DataNode#restarting is set, the OOB ack will be directly sent by these threads before taking down the packet responder threads. If the packet responder is in the middle of sending an ack, it can be blocked up to a configured amount of time before failing, which is 1.5 seconds by default. If they started sending but send takes a long time (e.g. slow client, network issue, etc.), they will get interrupted by DataXceiverServer in 2 seconds. DataXceiverServer will tear down sooner if all DataXceiver threads finish less than 2 seconds.

          The IPC server is stopped later in order to minimize the chance of shutdownDatanode() response being dropped. The shutdown method will only start interrupting the thread pool after a few seconds have passed since the DataXceiverServer interruption. By this time, all threads must have stopped, but if anyone didn't, they will get interrupted repeatedly. This is an existing behavior.

          The main DataNode thread joins on BP service threads. There was a fixed 2 second sleep, which has been changed to only wait until the shutdown is done. If the BP service threads terminated but shutdown() was not called, main thread will delay the exit for 2 seconds as it did before.

          This patch does not include the client-side changes, so the OOB ack will not have any visible effects. It will be treated as a node failure, which also happens when a datanode shuts down.

          Show
          Kihwal Lee added a comment - This patch triggers sending of the restart OOB ack to clients who are currently writing data. The shutdown ordering and timing have been adjusted to give enough time for DataXceiver threads (serving writes) to send the restart OOB ack upstream. First, DataXceiverServer is interrupted and in turn each DataXceiver threads are interrupted by it after closing the server socket to prevent further client connections. Idling DataXceiver threads due to keepalive will simply terminate. If DataNode#restarting is set, the OOB ack will be directly sent by these threads before taking down the packet responder threads. If the packet responder is in the middle of sending an ack, it can be blocked up to a configured amount of time before failing, which is 1.5 seconds by default. If they started sending but send takes a long time (e.g. slow client, network issue, etc.), they will get interrupted by DataXceiverServer in 2 seconds. DataXceiverServer will tear down sooner if all DataXceiver threads finish less than 2 seconds. The IPC server is stopped later in order to minimize the chance of shutdownDatanode() response being dropped. The shutdown method will only start interrupting the thread pool after a few seconds have passed since the DataXceiverServer interruption. By this time, all threads must have stopped, but if anyone didn't, they will get interrupted repeatedly. This is an existing behavior. The main DataNode thread joins on BP service threads. There was a fixed 2 second sleep, which has been changed to only wait until the shutdown is done. If the BP service threads terminated but shutdown() was not called, main thread will delay the exit for 2 seconds as it did before. This patch does not include the client-side changes, so the OOB ack will not have any visible effects. It will be treated as a node failure, which also happens when a datanode shuts down.
          Hide
          Kihwal Lee added a comment -

          The new patch addresses Vinay's comments.

          Enums renamed.

          + OOB_RESTART = 8; // Quick restart
          + OOB_RESERVED1 = 9; // Reserved
          + OOB_RESERVED2 = 10; // Reserved
          + OOB_RESERVED3 = 11; // Reserved

          Show
          Kihwal Lee added a comment - The new patch addresses Vinay's comments. Enums renamed. + OOB_RESTART = 8; // Quick restart + OOB_RESERVED1 = 9; // Reserved + OOB_RESERVED2 = 10; // Reserved + OOB_RESERVED3 = 11; // Reserved
          Hide
          Vinayakumar B added a comment -
          +  OOB_TYPE1 = 8;          // Quick restart
          +  OOB_TYPE2 = 9;          // Reserved
          +  OOB_TYPE3 = 10;         // Reserved
          +  OOB_TYPE4 = 11;         // Reserved
          

          I think instead of OOB_TYPE1, OOB_TYPE2 better names could be given.. any thoughts?

                if (!responderClosed) { // Abnormal termination.

          I think comment is no more holds good. May be that can be removed.

          changes done in sendAckUpstream() are not formatter correctly. Contains tab characters too.

          Java doc could be added for Status myStatus in sendAckUpstream()

          Show
          Vinayakumar B added a comment - + OOB_TYPE1 = 8; // Quick restart + OOB_TYPE2 = 9; // Reserved + OOB_TYPE3 = 10; // Reserved + OOB_TYPE4 = 11; // Reserved I think instead of OOB_TYPE1, OOB_TYPE2 better names could be given.. any thoughts? if (!responderClosed) { // Abnormal termination. I think comment is no more holds good. May be that can be removed. changes done in sendAckUpstream() are not formatter correctly. Contains tab characters too. Java doc could be added for Status myStatus in sendAckUpstream()
          Hide
          Kihwal Lee added a comment - - edited

          The patch makes DN send OOB acks to clients who are writing. The added test case currently doesn't do much, but after the client-side changes, it will be updated.

          The OOB Ack sending can still be verified from running the new test case. The test log should show something like following:

          [DataNode]
          2014-02-10 23:23:52,412 INFO datanode.DataNode (DataXceiverServer.java:run(190)) - Shutting down DataXceiverServer before restart
          2014-02-10 23:23:52,412 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(731)) - Shutting down for restart (BP-203907574-10.0.1.17-1392096230619:blk_1073741825_1002).
          2014-02-10 23:23:52,413 INFO datanode.DataNode (BlockReceiver.java:sendOOBResponse(977)) - Sending an out of band ack of type OOB_TYPE1

          [Upstream Datanode]
          2014-02-10 23:23:52,413 INFO datanode.DataNode (BlockReceiver.java:run(1060)) - Relaying an out of band ack of type OOB_TYPE1

          [Client]
          2014-02-10 23:23:52,414 WARN hdfs.DFSClient (DFSOutputStream.java:run(784)) - DFSOutputStream ResponseProcessor exception for block BP-203907574-10.0.1.17-1392096230619:blk_1073741825_1002
          java.io.IOException: Bad response OOB_TYPE1 for block BP-203907574-10.0.1.17-1392096230619:blk_1073741825_1002 from datanode 127.0.0.1:55182
          at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)

          Show
          Kihwal Lee added a comment - - edited The patch makes DN send OOB acks to clients who are writing. The added test case currently doesn't do much, but after the client-side changes, it will be updated. The OOB Ack sending can still be verified from running the new test case. The test log should show something like following: [DataNode] 2014-02-10 23:23:52,412 INFO datanode.DataNode (DataXceiverServer.java:run(190)) - Shutting down DataXceiverServer before restart 2014-02-10 23:23:52,412 INFO datanode.DataNode (BlockReceiver.java:receiveBlock(731)) - Shutting down for restart (BP-203907574-10.0.1.17-1392096230619:blk_1073741825_1002). 2014-02-10 23:23:52,413 INFO datanode.DataNode (BlockReceiver.java:sendOOBResponse(977)) - Sending an out of band ack of type OOB_TYPE1 [Upstream Datanode] 2014-02-10 23:23:52,413 INFO datanode.DataNode (BlockReceiver.java:run(1060)) - Relaying an out of band ack of type OOB_TYPE1 [Client] 2014-02-10 23:23:52,414 WARN hdfs.DFSClient (DFSOutputStream.java:run(784)) - DFSOutputStream ResponseProcessor exception for block BP-203907574-10.0.1.17-1392096230619:blk_1073741825_1002 java.io.IOException: Bad response OOB_TYPE1 for block BP-203907574-10.0.1.17-1392096230619:blk_1073741825_1002 from datanode 127.0.0.1:55182 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)
          Hide
          Kihwal Lee added a comment -

          This jira depends on HDFS-5585. I will post a patch, which applies on top of HDFS-5585.

          Show
          Kihwal Lee added a comment - This jira depends on HDFS-5585 . I will post a patch, which applies on top of HDFS-5585 .
          Hide
          Kihwal Lee added a comment -

          The client-side logic will be done in HDFS-5924. If the client-side change is missing, the OOB ack will simply treated as an error by clients.

          Show
          Kihwal Lee added a comment - The client-side logic will be done in HDFS-5924 . If the client-side change is missing, the OOB ack will simply treated as an error by clients.

            People

            • Assignee:
              Kihwal Lee
              Reporter:
              Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development