Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is an umbrella jira for discussing availability of the HDFS NN and providing references to other Jiras that improve its availability. This includes, but is not limited to, automatic failover.

        Issue Links

          Activity

          Hide
          Sanjay Radia added a comment -

          Some of the comments in the other Jiras have suggested that Yahoo has been working on only scalability and not availability. Both availability and scalability are important issues for us. Most folks equate availability with automatic failover; but there is more to availability than failover. The original purpose for HDFS was to support a batch processing system. This allowed one to rely on restart since batch jobs can be delayed. However the SLAs requirements for batch jobs are getting tighter. Further Hadoop is beginning to be used for near online or online services.

          Below is some of the work that has happened in improving the availability of the NN and in moving towards automatic failover. (some of these are in release 20 and others in trunk).

          • We have made a lot of progress in restarting a HDFS cluster. Two years ago, the restart time for a 2K cluster at Yahoo was several hours; one had to start 100 DNs at a time whenever the NN was rebooted. In trunk we have measured the restart time for a 3K cluster to be 30 minutes. Reducing restart time is important for failover: cold/warm failover performs all or part of the restart. Some of the steps we took:
            • reducing time to load fsImage and editlogs; you will see more of this in the next few months.
            • reduce the cost of a block report - the initial block report is needed for the NN to start providing service. Also we can safely restart the NN and deal with 3K initial block reports in our clusters.
              Facebook's internal patch puts block reports and heartbeats on a separate port - I understand that this has helped the start up time.
          • A major step towards HA was adding the backup namenode which synchronously gets the edit logs. This work needs to be extended to do an actual failiover. We are exploring manual failover using this backup NN and later doing an automatic failover using Zookeeper. There is also on going work on integrating bookkeeper with the NN. (I will explain the tradeoffs of the Backup NNs vs the bookkeeper in a future comment).
          Show
          Sanjay Radia added a comment - Some of the comments in the other Jiras have suggested that Yahoo has been working on only scalability and not availability. Both availability and scalability are important issues for us. Most folks equate availability with automatic failover; but there is more to availability than failover. The original purpose for HDFS was to support a batch processing system. This allowed one to rely on restart since batch jobs can be delayed. However the SLAs requirements for batch jobs are getting tighter. Further Hadoop is beginning to be used for near online or online services. Below is some of the work that has happened in improving the availability of the NN and in moving towards automatic failover. (some of these are in release 20 and others in trunk). We have made a lot of progress in restarting a HDFS cluster. Two years ago, the restart time for a 2K cluster at Yahoo was several hours; one had to start 100 DNs at a time whenever the NN was rebooted. In trunk we have measured the restart time for a 3K cluster to be 30 minutes. Reducing restart time is important for failover: cold/warm failover performs all or part of the restart. Some of the steps we took: reducing time to load fsImage and editlogs; you will see more of this in the next few months. reduce the cost of a block report - the initial block report is needed for the NN to start providing service. Also we can safely restart the NN and deal with 3K initial block reports in our clusters. Facebook's internal patch puts block reports and heartbeats on a separate port - I understand that this has helped the start up time. A major step towards HA was adding the backup namenode which synchronously gets the edit logs. This work needs to be extended to do an actual failiover. We are exploring manual failover using this backup NN and later doing an automatic failover using Zookeeper. There is also on going work on integrating bookkeeper with the NN. (I will explain the tradeoffs of the Backup NNs vs the bookkeeper in a future comment).
          Hide
          Jeff Hammerbacher added a comment -

          Hey,

          Could someone comment on how the BackupNameNode (BNN) differs from the AvatarNode (AN) that Dhruba is proposing?

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Hey, Could someone comment on how the BackupNameNode (BNN) differs from the AvatarNode (AN) that Dhruba is proposing? Thanks, Jeff
          Hide
          Jeff Hammerbacher added a comment -
          Show
          Jeff Hammerbacher added a comment - It's also probably worth linking to some of the DRBD-based work on HA that's been done elsewhere http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability http://www.cloudera.com/blog/2009/07/hadoop-ha-configuration/
          Hide
          Jeff Hammerbacher added a comment -

          Sorry, can't edit the above, but wanted to link to Dhruba's blog posts in case others had missed them:

          Also, a research paper: http://portal.acm.org/citation.cfm?id=1651271

          Show
          Jeff Hammerbacher added a comment - Sorry, can't edit the above, but wanted to link to Dhruba's blog posts in case others had missed them: http://hadoopblog.blogspot.com/2009/11/hdfs-high-availability.html http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html Also, a research paper: http://portal.acm.org/citation.cfm?id=1651271
          Hide
          dhruba borthakur added a comment -

          > Could someone comment on how the BackupNameNode (BNN) differs from the AvatarNode (AN) that Dhruba is proposing?

          The BackupNameNode (as it stands now) consumes the transaction logs from the primary NN. But it does not have block locations associated with each block. Thus, the failover time could be large. The AvatarNode is a hot-standby, and I have measured failover times (with 10 million files) of 10 seconds in a test cluster. (But, of course, the BackupNode could be extended in future to support hot standby)

          Show
          dhruba borthakur added a comment - > Could someone comment on how the BackupNameNode (BNN) differs from the AvatarNode (AN) that Dhruba is proposing? The BackupNameNode (as it stands now) consumes the transaction logs from the primary NN. But it does not have block locations associated with each block. Thus, the failover time could be large. The AvatarNode is a hot-standby, and I have measured failover times (with 10 million files) of 10 seconds in a test cluster. (But, of course, the BackupNode could be extended in future to support hot standby)
          Hide
          Jeff Hammerbacher added a comment -

          There is some documentation on how to use the UpRight library to achieve high availability for HDFS at http://code.google.com/p/upright/wiki/HDFSUpRightOverview

          Show
          Jeff Hammerbacher added a comment - There is some documentation on how to use the UpRight library to achieve high availability for HDFS at http://code.google.com/p/upright/wiki/HDFSUpRightOverview
          Hide
          André Oriani added a comment -

          Just complementing Jeff's comment. The UpRight library paper http://www.sigops.org/sosp/sosp09/papers/clement-sosp09.pdf

          Show
          André Oriani added a comment - Just complementing Jeff's comment. The UpRight library paper http://www.sigops.org/sosp/sosp09/papers/clement-sosp09.pdf
          Hide
          Daniel Einspanjer added a comment -

          I've noticed that for the current recommended configuration and for the two HDFS HA strategies I've seen, they all involve writing the NN data to NFS shared storage. This still seems to be a single point of failure to me. If the NFS becomes unreachable or if the NFS filer must be taken down for maintenance, where does that leave the cluster?

          I was recently investigating the possibility of using RHEL GFS or something similar to provide a high availability storage location for the NN data. It was suggested that I not take this path so I'm just trying to understand why it isn't necessary.

          Show
          Daniel Einspanjer added a comment - I've noticed that for the current recommended configuration and for the two HDFS HA strategies I've seen, they all involve writing the NN data to NFS shared storage. This still seems to be a single point of failure to me. If the NFS becomes unreachable or if the NFS filer must be taken down for maintenance, where does that leave the cluster? I was recently investigating the possibility of using RHEL GFS or something similar to provide a high availability storage location for the NN data. It was suggested that I not take this path so I'm just trying to understand why it isn't necessary.
          Hide
          Eli Collins added a comment -

          Hey Daniel,

          Failing the namenode over from one host to another that also has access to the latest image and edit log requires shared storage. Whether or not this shared storage is a single point of failure is orthogonal to whether the storage is made available via nfs or a lun (that's used to host a clustered file system). Many storage systems use redundant disks, power, network interfaces, multi-path IO etc to make the shared storage they export highly available - the underlying storage system more than the protocol used to access it will determine your availability.

          Note that for future releases the shared storage doesn't strictly need to be as or more available than the namenode host: there's been work on trunk to handle flaky dfs.name.dirs.

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Daniel, Failing the namenode over from one host to another that also has access to the latest image and edit log requires shared storage. Whether or not this shared storage is a single point of failure is orthogonal to whether the storage is made available via nfs or a lun (that's used to host a clustered file system). Many storage systems use redundant disks, power, network interfaces, multi-path IO etc to make the shared storage they export highly available - the underlying storage system more than the protocol used to access it will determine your availability. Note that for future releases the shared storage doesn't strictly need to be as or more available than the namenode host: there's been work on trunk to handle flaky dfs.name.dirs. Thanks, Eli
          Hide
          Sanjay Radia added a comment -

          Would users find the the use of HA NFS in a failover solution to be a showstopper? I agree that it is somewhat embarrassing to say that HDFS failover depends on HA NFS. The reason I ask is that, HA NFS as a shared storage is one of the fastest way for us to develop a HA solution.
          Q. Do users already have an NFS server that they can use for this purpose? For example at Yahoo we use NFS as one of several "disks" for edits and image.

          I don't see this as a final solution but merely a first step. A shared dual ported disk solution will require more work esp for storage fencing. Using BackupNN I suspect is also a little bit more complicated than using shared storage.
          (Btw as noted above , the AvatarNN uses NFS as part of its controlled manual failover during an upgrade. )

          Show
          Sanjay Radia added a comment - Would users find the the use of HA NFS in a failover solution to be a showstopper? I agree that it is somewhat embarrassing to say that HDFS failover depends on HA NFS. The reason I ask is that, HA NFS as a shared storage is one of the fastest way for us to develop a HA solution. Q. Do users already have an NFS server that they can use for this purpose? For example at Yahoo we use NFS as one of several "disks" for edits and image. I don't see this as a final solution but merely a first step. A shared dual ported disk solution will require more work esp for storage fencing. Using BackupNN I suspect is also a little bit more complicated than using shared storage. (Btw as noted above , the AvatarNN uses NFS as part of its controlled manual failover during an upgrade. )
          Hide
          stack added a comment -

          Shared storage as underpinnings of an HA NN strikes me as an odd approach. In the past, making a service HA, first thing I'd do was remove all dependencies on NFS.

          What is the thinking regards bootstrapping the backup NNs in-memory image? Are we talking about replay of edit logs still (possibly hours before FS comes back on line?) or something else? Is there talk of spraying the meta edits across primary and backup NNs so the backups can come on line the quicker? Thanks.

          Show
          stack added a comment - Shared storage as underpinnings of an HA NN strikes me as an odd approach. In the past, making a service HA, first thing I'd do was remove all dependencies on NFS. What is the thinking regards bootstrapping the backup NNs in-memory image? Are we talking about replay of edit logs still (possibly hours before FS comes back on line?) or something else? Is there talk of spraying the meta edits across primary and backup NNs so the backups can come on line the quicker? Thanks.
          Hide
          dhruba borthakur added a comment -

          hi Stack, as you rightly pointed out, there are two issues at play here.

          Case1. where to store the namenode transaction log?
          Case2. how to make the active namenode transfer control to a standby namenode and how long does this failover process take?

          In my view, these two issues are not tied up with one another.One can use NFS, DRBD or even newer things like zookeeper/bookkeeper for storing the transaction logs in Case1.

          For Case2, we still have to work out the details. Sanjay's proposal seems to say that we should attempt to address Case2 first. A design for Case2 should work irrespective of the type of storage we chose in Case1. In fact, getting the right design for Case2 is the more challenging part of implementing a fast failover.

          Show
          dhruba borthakur added a comment - hi Stack, as you rightly pointed out, there are two issues at play here. Case1. where to store the namenode transaction log? Case2. how to make the active namenode transfer control to a standby namenode and how long does this failover process take? In my view, these two issues are not tied up with one another.One can use NFS, DRBD or even newer things like zookeeper/bookkeeper for storing the transaction logs in Case1. For Case2, we still have to work out the details. Sanjay's proposal seems to say that we should attempt to address Case2 first. A design for Case2 should work irrespective of the type of storage we chose in Case1. In fact, getting the right design for Case2 is the more challenging part of implementing a fast failover.

            People

            • Assignee:
              Unassigned
              Reporter:
              Sanjay Radia
            • Votes:
              1 Vote for this issue
              Watchers:
              72 Start watching this issue

              Dates

              • Created:
                Updated:

                Development