Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-839

The NameNode should forward block reports to BackupNode

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      The BackupNode (via HADOOP-4539) receives a stream of transactions from NameNode. However, the BackupNode does not have block locations of blocks. It would be nice if the NameNode can forward all block reports (that it receives from DataNodes) to the BackupNode.

        Issue Links

          Activity

          Hide
          shv Konstantin Shvachko added a comment -

          Eli and Todd asked for the notes from the meeting on HA. I'll post them here in order to avoid recursive requests.

          My understanding of the current state of the HA issue is that there is a lot of unanswered questions and problems.
          Passing block locations to BN raised by Wang in this issue is one of them.
          Todd mentions the problem of lease recovery: how the recovery of unfinished writes is done in the HA world.
          Another problem mentioned by Dhruba is the race condition in removing replicas of a block when control switches from primary to backup, as both of them can decide to remove (potentially different) replicas of the same block.
          If you know other problems please share.

          I see there is a big interest in HA from many people / organizations.
          Unfortunately, I don't see a real proposal, which attempts to answer these and other potential problems.

          In my personal opinion we should do HA in 3-4 steps:
          1. Start from relatively simple task of implementing manual warm standby (see above for classification). This can be implemented in 2-4 weeks.
          2a, 2b. From there HDFS can evolve in either direction: automatic warm standby, or manual hot standby. The latter (2b) is Facebook's priority as I understand.
          3. After that the last two (2a+2b) technologies can be merged into automatic hot standby.

          Show
          shv Konstantin Shvachko added a comment - Eli and Todd asked for the notes from the meeting on HA. I'll post them here in order to avoid recursive requests. My understanding of the current state of the HA issue is that there is a lot of unanswered questions and problems. Passing block locations to BN raised by Wang in this issue is one of them. Todd mentions the problem of lease recovery: how the recovery of unfinished writes is done in the HA world. Another problem mentioned by Dhruba is the race condition in removing replicas of a block when control switches from primary to backup, as both of them can decide to remove (potentially different) replicas of the same block. If you know other problems please share. I see there is a big interest in HA from many people / organizations. Unfortunately, I don't see a real proposal, which attempts to answer these and other potential problems. In my personal opinion we should do HA in 3-4 steps: 1. Start from relatively simple task of implementing manual warm standby (see above for classification). This can be implemented in 2-4 weeks. 2a, 2b. From there HDFS can evolve in either direction: automatic warm standby, or manual hot standby. The latter (2b) is Facebook's priority as I understand. 3. After that the last two (2a+2b) technologies can be merged into automatic hot standby.
          Hide
          gnawux Wang Xu added a comment -

          I agree with Todd and Eli.

          Even manual fail-over requires a suite of operation interfaces, which could also be invoked by external mature HA tools.

          As a trade-off between consistency and performance, the block related info could be forward to BN in non-block manners:

          • From DN or even client to BN directly; or
          • Forward from NN to BN asynchronized.

          And I think the selection between the above two , together with forward info in EditStream, is the discussion in this issue.

          Show
          gnawux Wang Xu added a comment - I agree with Todd and Eli. Even manual fail-over requires a suite of operation interfaces, which could also be invoked by external mature HA tools. As a trade-off between consistency and performance, the block related info could be forward to BN in non-block manners: From DN or even client to BN directly; or Forward from NN to BN asynchronized. And I think the selection between the above two , together with forward info in EditStream, is the discussion in this issue.
          Hide
          eli Eli Collins added a comment -

          I think we should attempt to keep whether the fail over is manual or automatic largely orthogonal to HDFS, ie we need to provide the necessary interfaces to allow external software to do either manual or automatic fail over, and rely on that software to drive the fail over. ie let's leverage existing software where we can.

          My understanding is that this jira is about enabling a faster fail over (either manual or automatic) from the NN to the BN by actively syncing the necessary state between the two. Seems like the next step is to identify the set of the jiras we need to do to enable this beyond fwd'ing the block report: eg have the BN maintain an up-to-date edits log, reconstruct leases etc. Reasonable?

          Show
          eli Eli Collins added a comment - I think we should attempt to keep whether the fail over is manual or automatic largely orthogonal to HDFS, ie we need to provide the necessary interfaces to allow external software to do either manual or automatic fail over, and rely on that software to drive the fail over. ie let's leverage existing software where we can. My understanding is that this jira is about enabling a faster fail over (either manual or automatic) from the NN to the BN by actively syncing the necessary state between the two. Seems like the next step is to identify the set of the jiras we need to do to enable this beyond fwd'ing the block report: eg have the BN maintain an up-to-date edits log, reconstruct leases etc. Reasonable?
          Hide
          tlipcon Todd Lipcon added a comment -

          Does it make sense?

          Absolutely. In particular, I think the automatic standby modes shoulds be punted to external tools for the initial implementation. There are a lot of good tools for this, and as long as the manual failover modes are scriptable and reliable (read: automatically tested) we should feel comfortable using LinuxHA, ZK, or any other failure detector to trigger the failover.

          As I understand it, we have had "cold HA" for quite some time already. The BackupNode in 21 adds "warm standby". This JIRA is starting discussion about "hot standby".

          Is there a temperature in between warm and hot where the standby does not have block reports, but when a failover occurs, the DNs are instructured to immediately report? Thus, the failover would not have to wait for an entire block report interval to be operational.

          Show
          tlipcon Todd Lipcon added a comment - Does it make sense? Absolutely. In particular, I think the automatic standby modes shoulds be punted to external tools for the initial implementation. There are a lot of good tools for this, and as long as the manual failover modes are scriptable and reliable (read: automatically tested) we should feel comfortable using LinuxHA, ZK, or any other failure detector to trigger the failover. As I understand it, we have had "cold HA" for quite some time already. The BackupNode in 21 adds "warm standby". This JIRA is starting discussion about "hot standby". Is there a temperature in between warm and hot where the standby does not have block reports, but when a failover occurs, the DNs are instructured to immediately report? Thus, the failover would not have to wait for an entire block report interval to be operational.
          Hide
          shv Konstantin Shvachko added a comment -

          Wang> the information which BN has, decides the latency of fail-over.

          Correct. And this is the difference between warm standby and hot standby. By warm standby I mean a BN, which has only namespace information without block locations. So in order to startup it will wait for block reports from data-nodes that are switching to BN as a new primary. Waiting for all block reports can take time, which depends on the cluster size (could be 10-20 minutes on large clusters).
          But it is still faster than NN cold startup when it needs to read the image and edits before processing the block reports.
          With hot standby you will need BN to have all information the primary NN has.

          Following yesterday's discussion with folks from Facebook and Y! I want to clarify the classification of HA solutions in my previous comment.

          1. Manual or automatic cold HA. Start new NN from scratch when the old one dies. Does not need BN. Does not need changes to HDFS code if external tools are used for failure detection and restart. Suresh experimented with Linux HA, and Cloudera has a write up on that.
          2. Manual warm standby. An admin command is issued to switch to the BN. BN does not have block location - only the namespace.
          3. Automatic warm standby. Cluster components automatically switch to BN when the primary NN dies. BN does not have block locations.
          4. Manual hot standby. An admin command is issued to switch to the BN. BN has has maximum information to take over.
          5. Automatic hot standby. Cluster automatically switches to BN. BN has has maximum information to take over.

          I am arguing that we should not jump straight to automatic-hot-standby because it's a hard problem, but rather do it step by step, starting from manual-warm-standby. I am sure there will be substantial amount of work on each step. Lease recovery (as Todd proposes) and other requirements can be additional constraints defining how warm or how hot we want the standby node to be.
          Does it make sense?

          Show
          shv Konstantin Shvachko added a comment - Wang> the information which BN has, decides the latency of fail-over. Correct. And this is the difference between warm standby and hot standby. By warm standby I mean a BN, which has only namespace information without block locations. So in order to startup it will wait for block reports from data-nodes that are switching to BN as a new primary. Waiting for all block reports can take time, which depends on the cluster size (could be 10-20 minutes on large clusters). But it is still faster than NN cold startup when it needs to read the image and edits before processing the block reports. With hot standby you will need BN to have all information the primary NN has. Following yesterday's discussion with folks from Facebook and Y! I want to clarify the classification of HA solutions in my previous comment. Manual or automatic cold HA . Start new NN from scratch when the old one dies. Does not need BN. Does not need changes to HDFS code if external tools are used for failure detection and restart. Suresh experimented with Linux HA, and Cloudera has a write up on that. Manual warm standby . An admin command is issued to switch to the BN. BN does not have block location - only the namespace. Automatic warm standby . Cluster components automatically switch to BN when the primary NN dies. BN does not have block locations. Manual hot standby . An admin command is issued to switch to the BN. BN has has maximum information to take over. Automatic hot standby . Cluster automatically switches to BN. BN has has maximum information to take over. I am arguing that we should not jump straight to automatic-hot-standby because it's a hard problem, but rather do it step by step, starting from manual-warm-standby. I am sure there will be substantial amount of work on each step. Lease recovery (as Todd proposes) and other requirements can be additional constraints defining how warm or how hot we want the standby node to be. Does it make sense?
          Hide
          gnawux Wang Xu added a comment -

          Konstantin: I think the difference between manual or automatic fail-over is not only the operating procedure, but also the information consistency. Actually, if we have an interface for the administrator to change BN/NN's role and a monitor method, #1 could be automatically done with some external facilitate such as HA software. However, the infomation which BN has decides the latency of fail-over.

          Thus, should we also decide which information BN should have for different HA requirement? Such as:

          1. Persistent data: Namespace information, which has been logged by Backup Node, with an administrative issue, BN could swith its Role to NN
          2. Run-time file status information: lease information, which could help the service continueous during NN switch.
          3. Block distribution and replication information: which could accelerate the fail-over procedure.

          The different priority of data should have different synchronization policy, and if we forward less significant information, it could avoid lock holding.

          Show
          gnawux Wang Xu added a comment - Konstantin: I think the difference between manual or automatic fail-over is not only the operating procedure, but also the information consistency. Actually, if we have an interface for the administrator to change BN/NN's role and a monitor method, #1 could be automatically done with some external facilitate such as HA software. However, the infomation which BN has decides the latency of fail-over. Thus, should we also decide which information BN should have for different HA requirement? Such as: Persistent data: Namespace information, which has been logged by Backup Node, with an administrative issue, BN could swith its Role to NN Run-time file status information: lease information, which could help the service continueous during NN switch. Block distribution and replication information: which could accelerate the fail-over procedure. The different priority of data should have different synchronization policy, and if we forward less significant information, it could avoid lock holding.
          Hide
          tlipcon Todd Lipcon added a comment -

          I think before Konstantin's #1 failover, there's a #0.5 – the lease recovery of current writers is very nice to have, but there's some use for manual fast failover without this functionality.

          Show
          tlipcon Todd Lipcon added a comment - I think before Konstantin's #1 failover, there's a #0.5 – the lease recovery of current writers is very nice to have, but there's some use for manual fast failover without this functionality.
          Hide
          dhruba dhruba borthakur added a comment -

          I am interested in doing (1) from Konstantin's list along with the following addition:

          1. If the primary NN dies (unscheduled downtime), the adminstrator should be able to issue a command to the BN to change its role to be the primary NN.

          Show
          dhruba dhruba borthakur added a comment - I am interested in doing (1) from Konstantin's list along with the following addition: 1. If the primary NN dies (unscheduled downtime), the adminstrator should be able to issue a command to the BN to change its role to be the primary NN.
          Hide
          shv Konstantin Shvachko added a comment -

          I agree with Eli that block report handling is just a sub-task of the general design.

          I am not sure the hot standby solution should be rushed to achieve in one single step. Creating BackupNode I thought we would go through 3 stages:

          1. Manual fail-over to BN. You start a BacupNode and issue an admin command to the name-node to switch to the BN. The name-node enters safe mode, and replies to all data-nodes to their heartbeats that they should switch to BN. When the name-node does not have any more live data-nodes, you can safely turn it off. Similarly for active clients, NN asks them to connect to BN, when they do ls or apply for lease extension. This is useful when you need to take NN down for a scheduled service or hardware upgrade.
          2. Automatic fail-over to BN. This will be the warm standby.
            Cluster components a priori know where to switch in case of failure. And if NN fails they switch to BN. This has a wide range of possible solutions, and may need a lot of work. With warm standby you still have down-time measured in tens of minutes, but it is useful for both scheduled and unscheduled NN shutdowns.
          3. Hot stand by. This is a good thing to have. My intuition is that most people will be well satisfied with the warm solution.

          I was wondering is there any interest in doing (1)?

          Show
          shv Konstantin Shvachko added a comment - I agree with Eli that block report handling is just a sub-task of the general design. I am not sure the hot standby solution should be rushed to achieve in one single step. Creating BackupNode I thought we would go through 3 stages: Manual fail-over to BN . You start a BacupNode and issue an admin command to the name-node to switch to the BN. The name-node enters safe mode, and replies to all data-nodes to their heartbeats that they should switch to BN. When the name-node does not have any more live data-nodes, you can safely turn it off. Similarly for active clients, NN asks them to connect to BN, when they do ls or apply for lease extension. This is useful when you need to take NN down for a scheduled service or hardware upgrade. Automatic fail-over to BN. This will be the warm standby. Cluster components a priori know where to switch in case of failure. And if NN fails they switch to BN. This has a wide range of possible solutions, and may need a lot of work. With warm standby you still have down-time measured in tens of minutes, but it is useful for both scheduled and unscheduled NN shutdowns. Hot stand by. This is a good thing to have. My intuition is that most people will be well satisfied with the warm solution. I was wondering is there any interest in doing (1)?
          Hide
          eli Eli Collins added a comment -

          We should really figure out the requirements and specifics of the feature we're after before addressing details like how we handle block reports.

          Whether the NN or the DNs forward block reports to the BN(s) should be considered in the context of the overall design. eg above people have been making assumptions (eg the NN has to establish a pipeline to the BN(s) to forward block reports) that may or may not hold.

          Show
          eli Eli Collins added a comment - We should really figure out the requirements and specifics of the feature we're after before addressing details like how we handle block reports. Whether the NN or the DNs forward block reports to the BN(s) should be considered in the context of the overall design. eg above people have been making assumptions (eg the NN has to establish a pipeline to the BN(s) to forward block reports) that may or may not hold.
          Hide
          shv Konstantin Shvachko added a comment -

          I want to backup Dhruba, that passing each and every data-node op through the name-node will substantially increase load on the name-node and that holding lock longer will reduce its productivity.
          Another aspect here is that BN is also a NN. So if you supply it with block locations it will leave safe mode and start play role in block replication/deletion, lease recovery, etc, along with the name-node. If a data-node dies both NN and BN will start replicating its blocks unless you control it. So are we solving the split brain problem here? You can of course forbid replication when NN is in Backup role, but you will have hard time convincing e.g. Allen the solution is bullet proof and bug-free.

          Show
          shv Konstantin Shvachko added a comment - I want to backup Dhruba, that passing each and every data-node op through the name-node will substantially increase load on the name-node and that holding lock longer will reduce its productivity. Another aspect here is that BN is also a NN. So if you supply it with block locations it will leave safe mode and start play role in block replication/deletion, lease recovery, etc, along with the name-node. If a data-node dies both NN and BN will start replicating its blocks unless you control it. So are we solving the split brain problem here? You can of course forbid replication when NN is in Backup role, but you will have hard time convincing e.g. Allen the solution is bullet proof and bug-free.
          Hide
          shv Konstantin Shvachko added a comment -

          I want to backup Dhruba, that passing each and every data-node op through the name-node will substantially increase load on the name-node and that holding lock longer will reduce its productivity.
          Another aspect here is that BN is also a NN. So if you supply it with block locations it will leave safe mode and start play role in block replication/deletion, lease recovery, etc, along with the name-node. If a data-node dies both NN and BN will start replicating its blocks unless you control it. So are we solving the split brain problem here? You can of course forbid replication when NN is in Backup role, but you will have hard time convincing e.g. Allen the solution is bullet proof and bug-free.

          Show
          shv Konstantin Shvachko added a comment - I want to backup Dhruba, that passing each and every data-node op through the name-node will substantially increase load on the name-node and that holding lock longer will reduce its productivity. Another aspect here is that BN is also a NN. So if you supply it with block locations it will leave safe mode and start play role in block replication/deletion, lease recovery, etc, along with the name-node. If a data-node dies both NN and BN will start replicating its blocks unless you control it. So are we solving the split brain problem here? You can of course forbid replication when NN is in Backup role, but you will have hard time convincing e.g. Allen the solution is bullet proof and bug-free.
          Hide
          gnawux Wang Xu added a comment -

          todd, dhruba: as our practice, the duration of global lock does impact the performation, while it is insignificant unless there are many concurrent pure namespace op such as touchz or mkdir.

          and I think we assign different sync priority for consistency requirement. block info might be sync asynchronized.

          Show
          gnawux Wang Xu added a comment - todd, dhruba: as our practice, the duration of global lock does impact the performation, while it is insignificant unless there are many concurrent pure namespace op such as touchz or mkdir. and I think we assign different sync priority for consistency requirement. block info might be sync asynchronized.
          Hide
          dhruba dhruba borthakur added a comment -

          > the all-through-the-NN solution can simply start dropping if the queue hits some memory threshold

          precisely. There isn't much consistency needed for block reports and blocks received. (But the more upto data the BackupNode is regarding these reports, the faster is the failover). And since consistency between NN and BN is not required for block reports and block received, why to burden the NN to do these additional tasks. It is better to make the DN route these messages to the BN directly, isn't it?

          Show
          dhruba dhruba borthakur added a comment - > the all-through-the-NN solution can simply start dropping if the queue hits some memory threshold precisely. There isn't much consistency needed for block reports and blocks received. (But the more upto data the BackupNode is regarding these reports, the faster is the failover). And since consistency between NN and BN is not required for block reports and block received, why to burden the NN to do these additional tasks. It is better to make the DN route these messages to the BN directly, isn't it?
          Hide
          tlipcon Todd Lipcon added a comment -

          I see your point about additional memory pressure for queueing up the messages to be sent.

          I think to resolve this issue we should lay out what consistency and reliability we need. Specifically, are lost block reports OK? If so, then the all-through-the-NN solution can simply start dropping if the queue hits some memory threshold, right?

          Show
          tlipcon Todd Lipcon added a comment - I see your point about additional memory pressure for queueing up the messages to be sent. I think to resolve this issue we should lay out what consistency and reliability we need. Specifically, are lost block reports OK? If so, then the all-through-the-NN solution can simply start dropping if the queue hits some memory threshold, right?
          Hide
          dhruba dhruba borthakur added a comment -

          Todd: I agree that CPU is not what I am concerned about. The problem is two fold: if the NN dddecided to forward each and every message to the backupnode (as soon as it received it) then we have the problem of keeping the global lock for longer durations of time. If we buffer these messages to be send by other thread asynchronously, then we put additional memory pressure on the namenode. Do you agree?

          Show
          dhruba dhruba borthakur added a comment - Todd: I agree that CPU is not what I am concerned about. The problem is two fold: if the NN dddecided to forward each and every message to the backupnode (as soon as it received it) then we have the problem of keeping the global lock for longer durations of time. If we buffer these messages to be send by other thread asynchronously, then we put additional memory pressure on the namenode. Do you agree?
          Hide
          tlipcon Todd Lipcon added a comment -

          Dhruba: regarding NN performance, do you often find that CPU or network are limiting factors on your NNs? In my experience, the bottlenecks on the NN are (a) RAM, and (b) the coarse grained FSNamesystem synchronization. That is to say, CPU is only a bottleneck in that it can't begin to saturate all the cores of a multicore system.

          Given that, I imagine that the overhead of the NN having to forward block reports, etc, over to even several BNs should be fairly low. It also short circuits a bunch of tricky consistency questions if the DNs are reporting to multiple places. (I'm assuming that all the BN synchronization work can be deferred to other threads and shouldn't put any additional load on the FSNamesystem lock)

          Show
          tlipcon Todd Lipcon added a comment - Dhruba: regarding NN performance, do you often find that CPU or network are limiting factors on your NNs? In my experience, the bottlenecks on the NN are (a) RAM, and (b) the coarse grained FSNamesystem synchronization. That is to say, CPU is only a bottleneck in that it can't begin to saturate all the cores of a multicore system. Given that, I imagine that the overhead of the NN having to forward block reports, etc, over to even several BNs should be fairly low. It also short circuits a bunch of tricky consistency questions if the DNs are reporting to multiple places. (I'm assuming that all the BN synchronization work can be deferred to other threads and shouldn't put any additional load on the FSNamesystem lock)
          Hide
          gnawux Wang Xu added a comment -

          >> However, if it only synchronizes block reports or not sync all heartbeats,
          >
          > If we do this, then it will have some impact on the failover times, right? At the time of failover, the backupnode will not have all the information it needs; it will have to gather those missing pieces of information before it can start servicing request, isn't it?

          I think all block related information should be forwarded to BN, and the only different between us is how the info get to BN, isn't it?

          >> You could dedicate a link for NN to backup node communication to improv
          >
          > if all communication from datanodes to backupnodes flow via the NN, then in the case of NN death, the backup nodes might not have all the latest-and-greatest-data that was sent from the datanodes, isn't it?
          >

          BN should keep up with NN, otherwise the failover cannot be seamless.

          Show
          gnawux Wang Xu added a comment - >> However, if it only synchronizes block reports or not sync all heartbeats, > > If we do this, then it will have some impact on the failover times, right? At the time of failover, the backupnode will not have all the information it needs; it will have to gather those missing pieces of information before it can start servicing request, isn't it? I think all block related information should be forwarded to BN, and the only different between us is how the info get to BN, isn't it? >> You could dedicate a link for NN to backup node communication to improv > > if all communication from datanodes to backupnodes flow via the NN, then in the case of NN death, the backup nodes might not have all the latest-and-greatest-data that was sent from the datanodes, isn't it? > BN should keep up with NN, otherwise the failover cannot be seamless.
          Hide
          gnawux Wang Xu added a comment -

          For multiple BN, I think pipeline is unacceptable in performance.

          And I agree if DN report to NN could have better performance against the design that NN sync to BN. But will the under-replicate/invalid block status will matter?

          And can we introduce multcast for multiple NN solution? it may decrease the network overhead. The issue is that multicast do not provide any guarantee.

          Show
          gnawux Wang Xu added a comment - For multiple BN, I think pipeline is unacceptable in performance. And I agree if DN report to NN could have better performance against the design that NN sync to BN. But will the under-replicate/invalid block status will matter? And can we introduce multcast for multiple NN solution? it may decrease the network overhead. The issue is that multicast do not provide any guarantee.
          Hide
          dhruba dhruba borthakur added a comment -

          > However, if it only synchronizes block reports or not sync all heartbeats,

          If we do this, then it will have some impact on the failover times, right? At the time of failover, the backupnode will not have all the information it needs; it will have to gather those missing pieces of information before it can start servicing request, isn't it?

          > You could dedicate a link for NN to backup node communication to improv

          if all communication from datanodes to backupnodes flow via the NN, then in the case of NN death, the backup nodes might not have all the latest-and-greatest-data that was sent from the datanodes, isn't it?

          Show
          dhruba dhruba borthakur added a comment - > However, if it only synchronizes block reports or not sync all heartbeats, If we do this, then it will have some impact on the failover times, right? At the time of failover, the backupnode will not have all the information it needs; it will have to gather those missing pieces of information before it can start servicing request, isn't it? > You could dedicate a link for NN to backup node communication to improv if all communication from datanodes to backupnodes flow via the NN, then in the case of NN death, the backup nodes might not have all the latest-and-greatest-data that was sent from the datanodes, isn't it?
          Hide
          dhruba dhruba borthakur added a comment -

          > t seems like the network overhead of having each DN send it's block report to each backup node on a large cluster would be higher than a stream from the NN to the backup node

          I agree. But since this traffic is going straight from the datanode(s) to the backnode(s), they will be mostly equally distributed among all the datanodes. The alternative is that the NN has to streamline all block received messages to the backupnode, this could mean that you need namenode machines with greater horsepower. If the NN is streaming all blockreceived to one backupnode, it could still be fine, but if it has to stream it to multiple backup nodes in parallel that would be quite performance-unsettling. On the other hand, if the namenode pipelines these blockReceived to a pipeline of backupNodes, then the namenode has to go through a complex procedure to handle errors (if any) from the n-th backupnode in the pipleline.

          Show
          dhruba dhruba borthakur added a comment - > t seems like the network overhead of having each DN send it's block report to each backup node on a large cluster would be higher than a stream from the NN to the backup node I agree. But since this traffic is going straight from the datanode(s) to the backnode(s), they will be mostly equally distributed among all the datanodes. The alternative is that the NN has to streamline all block received messages to the backupnode, this could mean that you need namenode machines with greater horsepower. If the NN is streaming all blockreceived to one backupnode, it could still be fine, but if it has to stream it to multiple backup nodes in parallel that would be quite performance-unsettling. On the other hand, if the namenode pipelines these blockReceived to a pipeline of backupNodes, then the namenode has to go through a complex procedure to handle errors (if any) from the n-th backupnode in the pipleline.
          Hide
          gnawux Wang Xu added a comment -

          @dhruba
          If NN synchronizes all heartbeats to BN in a not very small cluster synchronized, it will be DDoS. However, if it only synchronizes block reports or not sync all heartbeats, it might increase NN's load slightly. After all it need synchronize namespace operations to BN.

          And I also think the under replicated list may be a trouble if NN and BN are managing blocks seperately.

          Show
          gnawux Wang Xu added a comment - @dhruba If NN synchronizes all heartbeats to BN in a not very small cluster synchronized, it will be DDoS. However, if it only synchronizes block reports or not sync all heartbeats, it might increase NN's load slightly. After all it need synchronize namespace operations to BN. And I also think the under replicated list may be a trouble if NN and BN are managing blocks seperately.
          Hide
          eli Eli Collins added a comment -

          Forgot to mention, suppose you could run a gossip protocol to advertise the block report to address the network throughput issue but that seems more complicated than providing high speed access between the NN and backup nodes.

          Show
          eli Eli Collins added a comment - Forgot to mention, suppose you could run a gossip protocol to advertise the block report to address the network throughput issue but that seems more complicated than providing high speed access between the NN and backup nodes.
          Hide
          eli Eli Collins added a comment -

          I am thinking that the namenode has to spend considerable horsepower to keep forwarding block reports and blocksReceived to BackupNode. I would like to have a HA solution without impacting the performance of the NN. Thus, the proposal to make datanodes send blocks reports and blocks received to NN as well as BN.

          Is the rate of block creation that high? If you have n backup nodes it seems like the network overhead of having each DN send it's block report to each backup node on a large cluster would be higher than a stream from the NN to the backup node. You could dedicate a link for NN to backup node communication to improve throughput but it's hard to do something similar for DNs all over the network.

          Show
          eli Eli Collins added a comment - I am thinking that the namenode has to spend considerable horsepower to keep forwarding block reports and blocksReceived to BackupNode. I would like to have a HA solution without impacting the performance of the NN. Thus, the proposal to make datanodes send blocks reports and blocks received to NN as well as BN. Is the rate of block creation that high? If you have n backup nodes it seems like the network overhead of having each DN send it's block report to each backup node on a large cluster would be higher than a stream from the NN to the backup node. You could dedicate a link for NN to backup node communication to improve throughput but it's hard to do something similar for DNs all over the network.
          Hide
          dhruba dhruba borthakur added a comment -

          > what's the rationale for doing so?

          I am thinking that the namenode has to spend considerable horsepower to keep forwarding block reports and blocksReceived to BackupNode. I would like to have a HA solution without impacting the performance of the NN. Thus, the proposal to make datanodes send blocks reports and blocks received to NN as well as BN.

          > the backup node can just re-replicate any blocks it thinks are under-replicated when it becomes the primary.

          Precisely.

          However, when we fail over from NN to BN, there could be blocks in the BN that we should quickly find if there are any blocks in the BN that do not have any valid replicas and then query all datanodes to send a block report for those blocks.

          Show
          dhruba dhruba borthakur added a comment - > what's the rationale for doing so? I am thinking that the namenode has to spend considerable horsepower to keep forwarding block reports and blocksReceived to BackupNode. I would like to have a HA solution without impacting the performance of the NN. Thus, the proposal to make datanodes send blocks reports and blocks received to NN as well as BN. > the backup node can just re-replicate any blocks it thinks are under-replicated when it becomes the primary. Precisely. However, when we fail over from NN to BN, there could be blocks in the BN that we should quickly find if there are any blocks in the BN that do not have any valid replicas and then query all datanodes to send a block report for those blocks.
          Hide
          eli Eli Collins added a comment -

          I would rather make the datanode send block reports and block received messages directly to the master as well as the BackupNode. do you agree?

          Once nice thing about having the namenode sync to the backup node is that there should be less divergence in state (eg the block maps) so you have an easier coherency problem between the primary and the backup. Can't think of why divergence would necessarily be hard to deal with though, eg the backup node can just re-replicate any blocks it thinks are under-replicated when it becomes the primary.

          Also wondering aloud if it would make implementing snapshots easier if the namenode gets to update the backup node state, eg could do so atomically.

          Having the datanodes send block reports/received directly to the backup node seems like it could work as well, what's the rationale for doing so?

          Show
          eli Eli Collins added a comment - I would rather make the datanode send block reports and block received messages directly to the master as well as the BackupNode. do you agree? Once nice thing about having the namenode sync to the backup node is that there should be less divergence in state (eg the block maps) so you have an easier coherency problem between the primary and the backup. Can't think of why divergence would necessarily be hard to deal with though, eg the backup node can just re-replicate any blocks it thinks are under-replicated when it becomes the primary. Also wondering aloud if it would make implementing snapshots easier if the namenode gets to update the backup node state, eg could do so atomically. Having the datanodes send block reports/received directly to the backup node seems like it could work as well, what's the rationale for doing so?
          Hide
          gnawux Wang Xu added a comment -

          But this means
          1 a DataNode need report to 2 NameNodes
          2 a DataNode should handshake with Backup Node before it report to it

          And other blocks changes should also be forwarded to Backup Node, such as block corrupt.

          Show
          gnawux Wang Xu added a comment - But this means 1 a DataNode need report to 2 NameNodes 2 a DataNode should handshake with Backup Node before it report to it And other blocks changes should also be forwarded to Backup Node, such as block corrupt.
          Hide
          dhruba dhruba borthakur added a comment -

          > Do you think these information should be sent to Backup Node in

          I would rather make the datanode send block reports and block received messages directly to the master as well as the BackupNode. do you agree?

          Show
          dhruba dhruba borthakur added a comment - > Do you think these information should be sent to Backup Node in I would rather make the datanode send block reports and block received messages directly to the master as well as the BackupNode. do you agree?
          Hide
          gnawux Wang Xu added a comment -

          Hi Dhruba,

          Thanks for your reply

          On Thu, Jan 14, 2010 at 10:40 PM, dhruba borthakur (JIRA) <jira@apache.org> wrote:
          > 1. new transactions from clients are blocked when the primary namenode is syncing transactions to a new slave

          For those modifing metadata, it is blocked.

          > 2. The automatic promotion of a slave to be a master (when the original master dies) based on zookeper is something for the future

          Yes.

          > 3. The datanodes will send block reports only to the master. so when a failover occurs you have to restart (or somehow tell) the datanodes to start sending block reports to the new master. This can increase failover times drastically.

          Information from DataNode, including block report, corrupt blocks, and
          heartbeat information are synchronized to slaves, though heartbeats
          from all datanode are collect and synchnoized together once per 30
          seconds. Thus it needn't restart datanode.

          > 4. I think we will somehow have to handle the split brain scenario where there are two masters running on the same cluster. We have to prevent such a case.

          We use linux-HA/heartbeat cluster with at least 3 nodes to cover split brain.

          > 5. Your future section regarding Backupode looks great.

          The nearer to mainline is better

          > My thinking is that datanodes would have to send block reports/block received to all the masters. This reduces the number of masters you can have in your system. But it will make the failover times quick and fast. any thoughts?
          >

          I agree. And I think it is the practical step for namenode to get to
          HA. Do you think these information should be sent to Backup Node in
          editlog stream?

          Show
          gnawux Wang Xu added a comment - Hi Dhruba, Thanks for your reply On Thu, Jan 14, 2010 at 10:40 PM, dhruba borthakur (JIRA) <jira@apache.org> wrote: > 1. new transactions from clients are blocked when the primary namenode is syncing transactions to a new slave For those modifing metadata, it is blocked. > 2. The automatic promotion of a slave to be a master (when the original master dies) based on zookeper is something for the future Yes. > 3. The datanodes will send block reports only to the master. so when a failover occurs you have to restart (or somehow tell) the datanodes to start sending block reports to the new master. This can increase failover times drastically. Information from DataNode, including block report, corrupt blocks, and heartbeat information are synchronized to slaves, though heartbeats from all datanode are collect and synchnoized together once per 30 seconds. Thus it needn't restart datanode. > 4. I think we will somehow have to handle the split brain scenario where there are two masters running on the same cluster. We have to prevent such a case. We use linux-HA/heartbeat cluster with at least 3 nodes to cover split brain. > 5. Your future section regarding Backupode looks great. The nearer to mainline is better > My thinking is that datanodes would have to send block reports/block received to all the masters. This reduces the number of masters you can have in your system. But it will make the failover times quick and fast. any thoughts? > I agree. And I think it is the practical step for namenode to get to HA. Do you think these information should be sent to Backup Node in editlog stream?
          Hide
          dhruba dhruba borthakur added a comment -

          Hi Wang, thanks for your document. It is very nicely written and easily understandable. Here the the notes I made myself while reading your document:

          1. new transactions from clients are blocked when the primary namenode is syncing transactions to a new slave
          2. The automatic promotion of a slave to be a master (when the original master dies) based on zookeper is something for the future
          3. The datanodes will send block reports only to the master. so when a failover occurs you have to restart (or somehow tell) the datanodes to start sending block reports to the new master. This can increase failover times drastically.
          4. I think we will somehow have to handle the split brain scenario where there are two masters running on the same cluster. We have to prevent such a case.
          5. Your future section regarding Backupode looks great.

          Thanks for the hard work. let's continue our dicussion here.

          My thinking is that datanodes would have to send block reports/block received to all the masters. This reduces the number of masters you can have in your system. But it will make the failover times quick and fast. any thoughts?

          Show
          dhruba dhruba borthakur added a comment - Hi Wang, thanks for your document. It is very nicely written and easily understandable. Here the the notes I made myself while reading your document: 1. new transactions from clients are blocked when the primary namenode is syncing transactions to a new slave 2. The automatic promotion of a slave to be a master (when the original master dies) based on zookeper is something for the future 3. The datanodes will send block reports only to the master. so when a failover occurs you have to restart (or somehow tell) the datanodes to start sending block reports to the new master. This can increase failover times drastically. 4. I think we will somehow have to handle the split brain scenario where there are two masters running on the same cluster. We have to prevent such a case. 5. Your future section regarding Backupode looks great. Thanks for the hard work. let's continue our dicussion here. My thinking is that datanodes would have to send block reports/block received to all the masters. This reduces the number of masters you can have in your system. But it will make the failover times quick and fast. any thoughts?
          Hide
          gnawux Wang Xu added a comment -

          Hello dhruba,

          I am interesting in this work, and we did some similar job like this before
          http://gnawux.info/hadoop/2010/01/pratice-of-namenode-cluster-for-hdfs-ha/

          Could you tell me the progress of this work and whether I could contribute
          anything on this job.

          Thank you!

          Regards,
          Wang Xu

          Show
          gnawux Wang Xu added a comment - Hello dhruba, I am interesting in this work, and we did some similar job like this before http://gnawux.info/hadoop/2010/01/pratice-of-namenode-cluster-for-hdfs-ha/ Could you tell me the progress of this work and whether I could contribute anything on this job. Thank you! Regards, Wang Xu

            People

            • Assignee:
              dhruba dhruba borthakur
              Reporter:
              dhruba dhruba borthakur
            • Votes:
              1 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:

                Development