Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1623 High Availability Framework for HDFS NN
  3. HDFS-1971

HA: Send block report from datanode to both active and standby namenodes

    Details

    • Hadoop Flags:
      Reviewed

      Description

      To enable hot standby namenode, the standby node must have current information for - namenode state (image + edits) and block location information. This jira addresses keeping the block location information current in the standby node. To do this, the proposed solution is to send block reports from the datanodes to both the active and the standby namenode.

      1. daulBr1.patch
        1.28 MB
        Sanjay Radia
      2. DualBlockReports.pdf
        151 kB
        Sanjay Radia
      3. dualBr2.patch
        12 kB
        Sanjay Radia
      4. dualBr3.patch
        62 kB
        Sanjay Radia
      5. dualbr4.txt
        85 kB
        Todd Lipcon
      6. dualbr5.txt
        85 kB
        Todd Lipcon
      7. dualBr6.patch
        85 kB
        Sanjay Radia

        Issue Links

          Activity

          Hide
          Uma Maheswara Rao G added a comment -

          Hi Sanjay,
          Are you started working on this? If not, can i take this issue? Because in our cluster, we already implemented this mechanism.
          So, we will be happy to contribute our efforts

          Show
          Uma Maheswara Rao G added a comment - Hi Sanjay, Are you started working on this? If not, can i take this issue? Because in our cluster, we already implemented this mechanism. So, we will be happy to contribute our efforts
          Hide
          Sanjay Radia added a comment -

          I have started to work on this and will post the design in a couple of days followed by the patch.

          Show
          Sanjay Radia added a comment - I have started to work on this and will post the design in a couple of days followed by the patch.
          Hide
          Sanjay Radia added a comment -

          Document describing issues and solutions is attached.

          Show
          Sanjay Radia added a comment - Document describing issues and solutions is attached.
          Hide
          Aaron T. Myers added a comment -

          Hey Sanjay, thanks a lot for posting this doc. Do you have any sense of which approach you'll go with? My inclination is probably approach 1, since it seems easiest and the disadvantages seem negligible.

          Show
          Aaron T. Myers added a comment - Hey Sanjay, thanks a lot for posting this doc. Do you have any sense of which approach you'll go with? My inclination is probably approach 1, since it seems easiest and the disadvantages seem negligible.
          Hide
          Eli Collins added a comment -

          The 1st paragraph assumes parallel is faster, is that true? An alternative solution is that a DNs reports to the standby after it has reported to the primary. This should allow the cluster to start up more quickly. The downside here of course is that you can't fail over to the standby until after the 2nd report has finished, but that's probably not a big deal. A disadvantage of this approach is that it lets the standby get out of sync, but we have to handle an unsychronized standy anyway (eg if the standby is brought up after the primary or restarted while the primary is running).

          The primary could also forward block BRs to the standby but I agree that we shouldn't pursue this approach as the implementation will be more complex and it unnecesarily restricts the potential parallelism (though I'm not sure it is actually slower, you could potentially transmit much less information over the network if you report from the primary to the standby). It also makes supporting multiple standbys more dificult.

          I like solution #1. Aside from the simplicity, I think preventing a scan of all the DN disks is important otherwise restarting the standby in a busy cluster will impact DN performance. You could also easily implement the above optimization of delaying the BR to the standby. 100M blocks seems low, eg a cluster with 4K hosts, 12 by 3TB drives/host and 256MB blocks is ~580M total blocks. However that's still < 10MB/host so I think it's OK.

          Show
          Eli Collins added a comment - The 1st paragraph assumes parallel is faster, is that true? An alternative solution is that a DNs reports to the standby after it has reported to the primary. This should allow the cluster to start up more quickly. The downside here of course is that you can't fail over to the standby until after the 2nd report has finished, but that's probably not a big deal. A disadvantage of this approach is that it lets the standby get out of sync, but we have to handle an unsychronized standy anyway (eg if the standby is brought up after the primary or restarted while the primary is running). The primary could also forward block BRs to the standby but I agree that we shouldn't pursue this approach as the implementation will be more complex and it unnecesarily restricts the potential parallelism (though I'm not sure it is actually slower, you could potentially transmit much less information over the network if you report from the primary to the standby). It also makes supporting multiple standbys more dificult. I like solution #1. Aside from the simplicity, I think preventing a scan of all the DN disks is important otherwise restarting the standby in a busy cluster will impact DN performance. You could also easily implement the above optimization of delaying the BR to the standby. 100M blocks seems low, eg a cluster with 4K hosts, 12 by 3TB drives/host and 256MB blocks is ~580M total blocks. However that's still < 10MB/host so I think it's OK.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Sanjay,

          Thanks for posting the document!.
          Looks solution 1 will be good and simpler.

          Here i have some questions.
          Assume one scenario, if DN contains a block and related file is not yet closed by client. So, that block information may not be available in standby node. When we send the block reports from DN to standby, it may give the command to DN as invalidates as that block information is not avalible in standby. When it receives the invalidates command from standBy node,what is the action in DN?
          Whether we will ignore the block invalidate commands from standBy node?

          --Thanks

          Show
          Uma Maheswara Rao G added a comment - Hi Sanjay, Thanks for posting the document!. Looks solution 1 will be good and simpler. Here i have some questions. Assume one scenario, if DN contains a block and related file is not yet closed by client. So, that block information may not be available in standby node. When we send the block reports from DN to standby, it may give the command to DN as invalidates as that block information is not avalible in standby. When it receives the invalidates command from standBy node,what is the action in DN? Whether we will ignore the block invalidate commands from standBy node? --Thanks
          Hide
          Sanjay Radia added a comment -

          @Uma descries a situation where the standby sends a command to DN when it receives a block info for a file under-construction before reading the the editsLog from the active NN for that file.

          The design posted in HDFS-1623 requires that the DN accepts commands only from the Active. Further the NN in standby state will not issue commands to any DN.

          However your scenario is a good one: a standby processing BRs from DNs and edits from Active NN may get them out of order.
          This scenario applies to HDFS-9175 that reads from a shared editsLog file system and also for Backup NN.

          Show
          Sanjay Radia added a comment - @Uma descries a situation where the standby sends a command to DN when it receives a block info for a file under-construction before reading the the editsLog from the active NN for that file. The design posted in HDFS-1623 requires that the DN accepts commands only from the Active. Further the NN in standby state will not issue commands to any DN. However your scenario is a good one: a standby processing BRs from DNs and edits from Active NN may get them out of order. This scenario applies to HDFS-9175 that reads from a shared editsLog file system and also for Backup NN.
          Hide
          Sanjay Radia added a comment -

          I am planning to use option 3 since it is not that much more complicated than 1.

          Show
          Sanjay Radia added a comment - I am planning to use option 3 since it is not that much more complicated than 1.
          Hide
          Eli Collins added a comment -

          Doesn't that still have the downside of double-scanning DN disks? Is it worth re-scanning all DN drives to save 2.5mb?

          Show
          Eli Collins added a comment - Doesn't that still have the downside of double-scanning DN disks? Is it worth re-scanning all DN drives to save 2.5mb?
          Hide
          Sanjay Radia added a comment -

          Partial early patch - not complete.

          Show
          Sanjay Radia added a comment - Partial early patch - not complete.
          Hide
          Sanjay Radia added a comment -

          I have attached a partial patch.

          • I have created a BPServiceToStandby class as a subclass of the BPOfferService class. However, I think I need to revisit this. The reason is that when NN failover occurs the thread that sends the BRs to standby should perform the same function as the main BPOfferService thread; we don't want to start a new BPOfferService since it will have to reregister and send new block reports. Hence I believe that I am thinking of simply adding a state to BPOfferService. Will post an updated patch.
          • The block-additions-deletions are passed from the primary BP thread to the secondary BP thread.
          • The patch generates the BRs again. The code can be easily modified to optimize for option 1 or option 3.
            • @eli - double-scanning DN disks ... Note for BRs we do not scan the disks but instead generate the BR from the Block-map data structure.
          Show
          Sanjay Radia added a comment - I have attached a partial patch. I have created a BPServiceToStandby class as a subclass of the BPOfferService class. However, I think I need to revisit this. The reason is that when NN failover occurs the thread that sends the BRs to standby should perform the same function as the main BPOfferService thread; we don't want to start a new BPOfferService since it will have to reregister and send new block reports. Hence I believe that I am thinking of simply adding a state to BPOfferService. Will post an updated patch. The block-additions-deletions are passed from the primary BP thread to the secondary BP thread. The patch generates the BRs again. The code can be easily modified to optimize for option 1 or option 3. @eli - double-scanning DN disks ... Note for BRs we do not scan the disks but instead generate the BR from the Block-map data structure.
          Hide
          Todd Lipcon added a comment -

          Hi Sanjay. It looks like the base of your diff might be the wrong branch - it includes a bunch of changes that aren't related to this JIRA. Can you reformat the diff?

          Show
          Todd Lipcon added a comment - Hi Sanjay. It looks like the base of your diff might be the wrong branch - it includes a bunch of changes that aren't related to this JIRA. Can you reformat the diff?
          Hide
          Sanjay Radia added a comment -

          Sorry. Will do. I am making some other changes and post an updated patch shortly.

          Show
          Sanjay Radia added a comment - Sorry. Will do. I am making some other changes and post an updated patch shortly.
          Hide
          Sanjay Radia added a comment -

          Sorry I had generated the diff against trunk instead of hdfs-1623 branch.

          However, I am going to post a significantly different patch v. shortly that does not subclass BPOfferService; instead a BPOfferService can be in one of two roles offer to Active or to Standby. This will allow us change the role when the NN failover occurs.

          Show
          Sanjay Radia added a comment - Sorry I had generated the diff against trunk instead of hdfs-1623 branch. However, I am going to post a significantly different patch v. shortly that does not subclass BPOfferService; instead a BPOfferService can be in one of two roles offer to Active or to Standby. This will allow us change the role when the NN failover occurs.
          Hide
          Sanjay Radia added a comment -

          Updated patch. Still a draft but good enough to read and get feedback.

          • Block report additions and deletions are
            give to both BPService thread (for active and standby)
          • BRs as per option 2: are created for each thread separately (note BR is created from in-memory BMap). Optimization 1 or 3 is under consideration.
          Show
          Sanjay Radia added a comment - Updated patch. Still a draft but good enough to read and get feedback. Block report additions and deletions are give to both BPService thread (for active and standby) BRs as per option 2: are created for each thread separately (note BR is created from in-memory BMap). Optimization 1 or 3 is under consideration.
          Hide
          Todd Lipcon added a comment -

          Hey Sanjay. I like the architecture of this patch - makes sense to separate into the different "Actor" objects. I have a number of small nits/improvements to suggest, but I can wait until you have a more final patch. Or, if you like, I can do an iteration on top of the patch you've posted here and re-post by end of day today with suggested changes.

          Show
          Todd Lipcon added a comment - Hey Sanjay. I like the architecture of this patch - makes sense to separate into the different "Actor" objects. I have a number of small nits/improvements to suggest, but I can wait until you have a more final patch. Or, if you like, I can do an iteration on top of the patch you've posted here and re-post by end of day today with suggested changes.
          Hide
          Todd Lipcon added a comment -

          Only non-nitty thing I would suggest: rather than having just a bpServiceToActive and bpServiceToStandby, could we consider having:

          List<BPServiceActor> bpServiceActors;
          BPServiceActor bpServiceToActive; // reference to one of the above NNs
          

          then we can change a lot of the code to loop over bpServiceActors rather than having two separate calls to active and standby.

          Although we don't plan to support multiple-standby in today's implementation, I don't think the above introduces any extra complexity, and in fact will make the code cleaner, while also paving the way for a multiple-standby implementation in the future.

          Show
          Todd Lipcon added a comment - Only non-nitty thing I would suggest: rather than having just a bpServiceToActive and bpServiceToStandby, could we consider having: List<BPServiceActor> bpServiceActors; BPServiceActor bpServiceToActive; // reference to one of the above NNs then we can change a lot of the code to loop over bpServiceActors rather than having two separate calls to active and standby. Although we don't plan to support multiple-standby in today's implementation, I don't think the above introduces any extra complexity, and in fact will make the code cleaner, while also paving the way for a multiple-standby implementation in the future.
          Hide
          Todd Lipcon added a comment -

          I rebased Sanjay's original patch onto the cleaned up tip of branch and also added some unit tests.

          Most of the tests pass at this point.

          One major "TODO" area which causes test failures is around refreshNameNodes() - I intend to resurrect that after HDFS-2582.

          Show
          Todd Lipcon added a comment - I rebased Sanjay's original patch onto the cleaned up tip of branch and also added some unit tests. Most of the tests pass at this point. One major "TODO" area which causes test failures is around refreshNameNodes() - I intend to resurrect that after HDFS-2582 .
          Hide
          Todd Lipcon added a comment -

          This patch just has a few small cleanups from dualbr4.patch.

          It still leaves the refreshNamenodes bit TODO, and adds a few more TODOs where we have to sort out what to do in terms of HA. I think it's worth committing this now, even with the TODOs, so we can address the TODOs in parallel. It will unblock other work and allow us to start figuring out more subtle issues.

          I can file follow-up JIRAs for each of the TODOs when this is committed.

          I've been testing this patch on top of HDFS-2582 and HDFS-2591 in that order, though I think it only strictly depends on HDFS-2582

          Show
          Todd Lipcon added a comment - This patch just has a few small cleanups from dualbr4.patch. It still leaves the refreshNamenodes bit TODO, and adds a few more TODOs where we have to sort out what to do in terms of HA. I think it's worth committing this now, even with the TODOs, so we can address the TODOs in parallel. It will unblock other work and allow us to start figuring out more subtle issues. I can file follow-up JIRAs for each of the TODOs when this is committed. I've been testing this patch on top of HDFS-2582 and HDFS-2591 in that order, though I think it only strictly depends on HDFS-2582
          Hide
          Sanjay Radia added a comment -

          I am reviewing the changes you made and have some additional minor changes; will post an updated patch later today or early tomorrow.
          Here is my feedback on your changes so far (more to come later today).
          1) DatanodeRegistration getDNRegistrationByMachineName(String mName)
          The TODO should be a separate jira - not related to this jira or HA.

          2) public String getNamenodeAddresses() {
          why removed the bpos.isALive() check that I had added: if (bpos != null && bpos.isAlive()) {
          The trunk code has: if (bpos != null && bpos.bpThread != null) {

          3) public void reportRemoteBadBlock(DatanodeInfo srcDataNode, ExtendedBlock block) does not belong here and not much to do with HA. Separate Jira and discussion.

          Show
          Sanjay Radia added a comment - I am reviewing the changes you made and have some additional minor changes; will post an updated patch later today or early tomorrow. Here is my feedback on your changes so far (more to come later today). 1) DatanodeRegistration getDNRegistrationByMachineName(String mName) The TODO should be a separate jira - not related to this jira or HA. 2) public String getNamenodeAddresses() { why removed the bpos.isALive() check that I had added: if (bpos != null && bpos.isAlive()) { The trunk code has: if (bpos != null && bpos.bpThread != null) { 3) public void reportRemoteBadBlock(DatanodeInfo srcDataNode, ExtendedBlock block) does not belong here and not much to do with HA. Separate Jira and discussion.
          Hide
          Todd Lipcon added a comment -

          I'll file a separate JIRA for #1 above.

          Let me look into why I made that change with regard to #2

          Regarding #3, the reasoning was that we used to grab references to a specific DatanodeProtocol proxy in BlockReceiver#verifyChunks. Now, we need to report that bad block to both NNs. So, it made more sense to add this new method to DataNode to pass the bad block info through to the correct BPOS, matching what we do for blockReceived, etc.

          Show
          Todd Lipcon added a comment - I'll file a separate JIRA for #1 above. Let me look into why I made that change with regard to #2 Regarding #3, the reasoning was that we used to grab references to a specific DatanodeProtocol proxy in BlockReceiver#verifyChunks. Now, we need to report that bad block to both NNs. So, it made more sense to add this new method to DataNode to pass the bad block info through to the correct BPOS, matching what we do for blockReceived, etc.
          Hide
          Todd Lipcon added a comment -

          why removed the bpos.isALive() check that I had added: if (bpos != null && bpos.isAlive()) {

          Since this function appears to just be used for monitoring, I think it actually can be improved to return a deeper map, like:

          [ {blockPoolId: "Unknown",
             namenodes: [
               { address: "1.2.3.4:8020",
                 mode: CONNECTING },
               { address: "1.2.3.5:8020",
                 mode: CONNECTING },
              ]},
             {blockPoolId: "BP-12345",
             namenodes: [
               { address: "1.2.3.6:8020",
                 mode: ACTIVE },
               { address: "1.2.3.7:8020",
                 mode: STANDBY },
              ]}
          ]
          

          ... ie we'd include NN information even while they're still trying to handshake (and don't know their BP ID). If that makes sense I'll file another JIRA to do this.

          DatanodeRegistration getDNRegistrationByMachineName

          I filed HDFS-2609 for this

          Show
          Todd Lipcon added a comment - why removed the bpos.isALive() check that I had added: if (bpos != null && bpos.isAlive()) { Since this function appears to just be used for monitoring, I think it actually can be improved to return a deeper map, like: [ {blockPoolId: "Unknown" , namenodes: [ { address: "1.2.3.4:8020" , mode: CONNECTING }, { address: "1.2.3.5:8020" , mode: CONNECTING }, ]}, {blockPoolId: "BP-12345" , namenodes: [ { address: "1.2.3.6:8020" , mode: ACTIVE }, { address: "1.2.3.7:8020" , mode: STANDBY }, ]} ] ... ie we'd include NN information even while they're still trying to handshake (and don't know their BP ID). If that makes sense I'll file another JIRA to do this. DatanodeRegistration getDNRegistrationByMachineName I filed HDFS-2609 for this
          Hide
          Sanjay Radia added a comment -

          Went over the changes Todd made to my original patch. I like Todd's change to array of actors. Summary of the reasons for changing to actors:

          • Both NNs may be Standbys during startup and also at other times
          • The design can support multiple Standbys and even Active-Active.

          I am attaching an updated patch with following changes

          1. Went back to my original impl for notifyNamenodeReceivedBlock() and notifyNamenodeDeletedBlock()
            so that ReceivedDeletedBlockInfo is allocated in BPOfferService and then passed to the actors so that the
            actors can share the object.
          2. void triggerBlockReportForTests() minor clean up
          3. minor javadoc improvements
          Show
          Sanjay Radia added a comment - Went over the changes Todd made to my original patch. I like Todd's change to array of actors. Summary of the reasons for changing to actors: Both NNs may be Standbys during startup and also at other times The design can support multiple Standbys and even Active-Active. I am attaching an updated patch with following changes Went back to my original impl for notifyNamenodeReceivedBlock() and notifyNamenodeDeletedBlock() so that ReceivedDeletedBlockInfo is allocated in BPOfferService and then passed to the actors so that the actors can share the object. void triggerBlockReportForTests() minor clean up minor javadoc improvements
          Hide
          Sanjay Radia added a comment -

          Updated patch.
          Todd I can commit this if you are okay with the changes I made.

          Show
          Sanjay Radia added a comment - Updated patch. Todd I can commit this if you are okay with the changes I made.
          Hide
          Todd Lipcon added a comment -

          Great, I'll take a look momentarily. Thanks Sanjay!

          Show
          Todd Lipcon added a comment - Great, I'll take a look momentarily. Thanks Sanjay!
          Hide
          Todd Lipcon added a comment -

          Looked over the delta. Good improvements, thanks for review. +1 to commit!

          Show
          Todd Lipcon added a comment - Looked over the delta. Good improvements, thanks for review. +1 to commit!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12505672/dualBr6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 11 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1616//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505672/dualBr6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1616//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          The Hadoop QA bot can't test this out since it's on a branch. But, I think it's best to just commit so we can move fast on the branch, and we'll make sure that the merge at the end passes QA and doesn't introduce findbugs, etc.

          Show
          Todd Lipcon added a comment - The Hadoop QA bot can't test this out since it's on a branch. But, I think it's best to just commit so we can move fast on the branch, and we'll make sure that the merge at the end passes QA and doesn't introduce findbugs, etc.
          Hide
          Suresh Srinivas added a comment -

          I committed the patch.

          Show
          Suresh Srinivas added a comment - I committed the patch.

            People

            • Assignee:
              Sanjay Radia
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development