Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1051

Umbrella Jira for Scaling the HDFS Name Service

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.22.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The HDFS Name service currently uses a single Namenode which limits its scalability. This is a master jira to track sub-jiras to address this problem.

        Issue Links

          Activity

          Sanjay Radia created issue -
          Hide
          Sanjay Radia added a comment -

          The HDFS Name service currently uses a single Namenode. Namenode maintains the entire file system metadata in memory. This inlcudes:

          • the list of files and directories (ie their inodes)
          • the block map which maps blockids to their locations

          The size of the metadata is limited by the physical memory available on the node. This results in the following issues:

          • Number of files that can be created.
          • Number of block, and hence the total storage that can be addressed.

          There are various ways to fix this. Some of these have been identified in
          http://wiki.apache.org/hadoop/HdfsFutures.

          Show
          Sanjay Radia added a comment - The HDFS Name service currently uses a single Namenode. Namenode maintains the entire file system metadata in memory. This inlcudes: the list of files and directories (ie their inodes) the block map which maps blockids to their locations The size of the metadata is limited by the physical memory available on the node. This results in the following issues: Number of files that can be created. Number of block, and hence the total storage that can be addressed. There are various ways to fix this. Some of these have been identified in http://wiki.apache.org/hadoop/HdfsFutures .
          Suresh Srinivas made changes -
          Field Original Value New Value
          Link This issue incorporates HDFS-1052 [ HDFS-1052 ]
          Hide
          ryan rawson added a comment -

          Perhaps we could think of using Zab from Zookeeper to keep a cluster of NewNameNodes in sync all the time? Instead of doing a failure detection and failover and recovery, we could keep 2N+1 nodes always up to date. No recovery would be necessary during node failover, rolling restarts and machine moves would be doable (just like in ZK). This could reduce the deployment complexity over some of the other options.

          For metadata scalability in RAM, a possibility would be to leverage flash disk in some capacity. Swap, mmap, explicit local files in flash would be reasonably fast and could extend the serviceable life of the "1 machine holds all metadata" architecture concept.

          Show
          ryan rawson added a comment - Perhaps we could think of using Zab from Zookeeper to keep a cluster of NewNameNodes in sync all the time? Instead of doing a failure detection and failover and recovery, we could keep 2N+1 nodes always up to date. No recovery would be necessary during node failover, rolling restarts and machine moves would be doable (just like in ZK). This could reduce the deployment complexity over some of the other options. For metadata scalability in RAM, a possibility would be to leverage flash disk in some capacity. Swap, mmap, explicit local files in flash would be reasonably fast and could extend the serviceable life of the "1 machine holds all metadata" architecture concept.
          Hide
          Jeff Hammerbacher added a comment -

          Fix Version/s: 0.22.0. I like the ambition! Purely out of curiosity, does Yahoo! see this feature as a blocker for the 0.22 release? It would be nice to do a release before then for certain then.

          Show
          Jeff Hammerbacher added a comment - Fix Version/s: 0.22.0. I like the ambition! Purely out of curiosity, does Yahoo! see this feature as a blocker for the 0.22 release? It would be nice to do a release before then for certain then.
          Hide
          ryan rawson added a comment -

          One other note - most people have problems with the availability of NameNode, not the scalability of the cluster (ie: medium data over big data). I would argue that availability of the namenode is the highest priority and is holding HDFS back from widespread adoption.

          Show
          ryan rawson added a comment - One other note - most people have problems with the availability of NameNode, not the scalability of the cluster (ie: medium data over big data). I would argue that availability of the namenode is the highest priority and is holding HDFS back from widespread adoption.
          Sanjay Radia made changes -
          Fix Version/s 0.22.0 [ 12314241 ]
          Hide
          Sanjay Radia added a comment -

          Oops. Didn't intend to say release 22 - this master jira will remain open for a very long time.

          Show
          Sanjay Radia added a comment - Oops. Didn't intend to say release 22 - this master jira will remain open for a very long time.
          Hide
          Kevin Weil added a comment -

          I agree with Ryan. There are probably not more than a handful of Hadoop clusters in the world getting close to maximizing NN memory (and machines with 144 GB memory or more aren't too hard to find these days). Twitter's biggest challenge with bringing Hadoop closer to the runtime path is its lack of HA guarantees.

          Show
          Kevin Weil added a comment - I agree with Ryan. There are probably not more than a handful of Hadoop clusters in the world getting close to maximizing NN memory (and machines with 144 GB memory or more aren't too hard to find these days). Twitter's biggest challenge with bringing Hadoop closer to the runtime path is its lack of HA guarantees.
          Hide
          Sanjay Radia added a comment -
          • Filing of this Jira does not exclude work on HA. These are independent issues.
          • BTW there has been progress on HA:
            • We have made a lot of progress in restarting a HDFS cluster - down from several hours to less than 30 minutes for a 3K cluster over the last 2 years. Reducing restart time is important for HA: cold, warm failover performs all or part of the restart.
            • Yahoo, along with reducing restart time, took a major step towards HA by adding the backup namenode which synchronously gets the edit logs. This work needs to be extended to do an actual failiover. For example Dhruba@Facebook has taken an additional step in exploring manual failover.
          • Hadoop is open source community project. Each contributor makes his/her own priorities. Please help.
          Show
          Sanjay Radia added a comment - Filing of this Jira does not exclude work on HA. These are independent issues. BTW there has been progress on HA: We have made a lot of progress in restarting a HDFS cluster - down from several hours to less than 30 minutes for a 3K cluster over the last 2 years. Reducing restart time is important for HA: cold, warm failover performs all or part of the restart. Yahoo, along with reducing restart time, took a major step towards HA by adding the backup namenode which synchronously gets the edit logs. This work needs to be extended to do an actual failiover. For example Dhruba@Facebook has taken an additional step in exploring manual failover. Hadoop is open source community project. Each contributor makes his/her own priorities. Please help.
          Hide
          Sanjay Radia added a comment -

          > NN memory ... machines with 144 GB memory or more aren't too hard to find these days ....
          Unfortunately it is not that simple.
          We currently run 50GB heaps at Yahoo and have issues in term of GC pauses and also startup time.
          (And startup time is important for HA as mentioned in my previous comment.)
          Not sure how well the Java VMs scale to 144GB.

          Show
          Sanjay Radia added a comment - > NN memory ... machines with 144 GB memory or more aren't too hard to find these days .... Unfortunately it is not that simple. We currently run 50GB heaps at Yahoo and have issues in term of GC pauses and also startup time. (And startup time is important for HA as mentioned in my previous comment.) Not sure how well the Java VMs scale to 144GB.
          Hide
          Sanjay Radia added a comment -

          I have created HDFS-1064 for discussing improvements to the availability of NN.

          Show
          Sanjay Radia added a comment - I have created HDFS-1064 for discussing improvements to the availability of NN.
          Jeff Hammerbacher made changes -
          Link This issue relates to HDFS-1068 [ HDFS-1068 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HADOOP-6714 [ HADOOP-6714 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HADOOP-6713 [ HADOOP-6713 ]
          Tom White made changes -
          Link This issue relates to HADOOP-6714 [ HADOOP-6714 ]
          Hide
          Jeff Hammerbacher added a comment -

          Another piece of research germane to this JIRA: "Haceph: Scalable Metadata Management for Hadoop using Ceph
          " from UCSC (http://www.soe.ucsc.edu/~carlosm/Papers/eestolan-nsdi10-abstract.pdf)

          Show
          Jeff Hammerbacher added a comment - Another piece of research germane to this JIRA: "Haceph: Scalable Metadata Management for Hadoop using Ceph " from UCSC ( http://www.soe.ucsc.edu/~carlosm/Papers/eestolan-nsdi10-abstract.pdf )
          Jeff Hammerbacher made changes -
          Link This issue relates to HDFS-1162 [ HDFS-1162 ]
          Hide
          Jeff Hammerbacher added a comment -

          Another path to scalability: BAH's Aaron Cordova has a prototype of using HBase for storing the NN metadata up at http://code.google.com/p/hdfs-dnn/.

          Show
          Jeff Hammerbacher added a comment - Another path to scalability: BAH's Aaron Cordova has a prototype of using HBase for storing the NN metadata up at http://code.google.com/p/hdfs-dnn/ .
          Jeff Hammerbacher made changes -
          Link This issue relates to HDFS-1499 [ HDFS-1499 ]
          Hide
          Allen Wittenauer added a comment -

          I suspect the work here is done, but it would be good for someone to go through this and the subjiras to make sure....

          Show
          Allen Wittenauer added a comment - I suspect the work here is done, but it would be good for someone to go through this and the subjiras to make sure....

            People

            • Assignee:
              Sanjay Radia
              Reporter:
              Sanjay Radia
            • Votes:
              1 Vote for this issue
              Watchers:
              60 Start watching this issue

              Dates

              • Created:
                Updated:

                Development