Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1961

New architectural documentation created

    Details

      Description

      This material provides an overview of the HDFS architecture and is intended for contributors. The goal of this document is to provide a guide to the overall structure of the HDFS code so that contributors can more effectively understand how changes that they are considering can be made, and the consequences of those changes. The assumption is that the reader has a basic understanding of HDFS, its purpose, and how it fits into the Hadoop project suite.

      An HTML version of the architectural documentation can be found at: http://kazman.shidler.hawaii.edu/ArchDoc.html

      All comments and suggestions for improvements are appreciated.

      1. HDFS ArchDoc.Jira.docx
        136 kB
        Rick Kazman
      2. HDFS-1961_ArchDoc.comments.docx
        145 kB
        Matt Foley
      3. HDFS-1961_ArchDoc.comments.RK.052011.docx
        145 kB
        Rick Kazman

        Activity

        Hide
        Rick Kazman added a comment -

        This is a Word version of the architecture documentation. The HTML version can be found at:
        http://kazman.shidler.hawaii.edu/ArchDoc.html

        Show
        Rick Kazman added a comment - This is a Word version of the architecture documentation. The HTML version can be found at: http://kazman.shidler.hawaii.edu/ArchDoc.html
        Hide
        Rick Kazman added a comment -

        We expect to be making periodic updates to this document. Our first update task is to add sequence diagrams to section 6.

        Show
        Rick Kazman added a comment - We expect to be making periodic updates to this document. Our first update task is to add sequence diagrams to section 6.
        Hide
        Matt Foley added a comment -

        Good start. A few suggestions:

        Section 4.4: Suggest start with:
        "All communication between Namenode and Datanode is initiated by the Datanode, and responded to by the Namenode. The Namenode never initiates communication to the Datanode, although Namenode responses may include commands to the Datanode that cause it to send further communications."

        4.4.2 "DataNode Command – send heartbeat."
        Suggest change to "DataNode sends Heartbeat."

        4.4.3 "DataNodeCommand – block report."
        Suggest change to "DataNode sends BlockReport."

        4.4.4 "BlockReceived."
        Suggest change to "DataNode notifies BlockReceived."

        Section 5.2:
        In the list of NN "threads", calling the first one "HeartBeat" is a little confusing. Please consider calling it something like "Datanode Health Management", instead. In the code it is called "HeartbeatMonitor", but its job is neither sending nor receiving heartbeats, but rather to periodically check to make sure that every Datanode has sent a heartbeat at least once in the last 10 minutes (or as configured).
        Should probably also mention the bundle of threads that provide the Namenode's RPC service, which receives and processes all 13 kinds of communication from Datanodes and Clients.

        Section 5.3:
        "This [blockReceived notification] may prevent NameNode temporarily from asking for a full block report since the receipt of a blockReceived() message indicates that the DataNode is still alive."
        That sentence isn't correct, since it is relatively unusual for the NN to ask the DN for a block report. (It only happens when recovering from gross errors.)

        Instead, suggest including in this section a brief discussion of the fact that the DN sends a heartbeat to the NN every 3 seconds (or as configured), which allows the NN a chance to respond with commands such as

        • "delete replica" if a block has become over-replicated, or
        • "copy replica to this other DN" if a block needs further replication.
          And the DN initiates a BlockReport to the NN every hour (or as configured), which prevents any divergence in the NN and DN belief about which replicas are held by each datanode.
          And yes, it also sends an immediate blockReceived notification whenever it receives a new block, whether from a Client (file create/append), or from another Datanode (block replication).

        "A blockReport() is also issued periodically as a portion of the HeartBeat."
        Not exactly. The DN's "heartbeat thread" takes care of sending both, at the appropriate time intervals, but they are separate RPCs to the NN.

        Show
        Matt Foley added a comment - Good start. A few suggestions: Section 4.4: Suggest start with: "All communication between Namenode and Datanode is initiated by the Datanode, and responded to by the Namenode. The Namenode never initiates communication to the Datanode, although Namenode responses may include commands to the Datanode that cause it to send further communications." 4.4.2 "DataNode Command – send heartbeat." Suggest change to "DataNode sends Heartbeat." 4.4.3 "DataNodeCommand – block report." Suggest change to "DataNode sends BlockReport." 4.4.4 "BlockReceived." Suggest change to "DataNode notifies BlockReceived." Section 5.2: In the list of NN "threads", calling the first one "HeartBeat" is a little confusing. Please consider calling it something like "Datanode Health Management", instead. In the code it is called "HeartbeatMonitor", but its job is neither sending nor receiving heartbeats, but rather to periodically check to make sure that every Datanode has sent a heartbeat at least once in the last 10 minutes (or as configured). Should probably also mention the bundle of threads that provide the Namenode's RPC service, which receives and processes all 13 kinds of communication from Datanodes and Clients. Section 5.3: "This [blockReceived notification] may prevent NameNode temporarily from asking for a full block report since the receipt of a blockReceived() message indicates that the DataNode is still alive." That sentence isn't correct, since it is relatively unusual for the NN to ask the DN for a block report. (It only happens when recovering from gross errors.) Instead, suggest including in this section a brief discussion of the fact that the DN sends a heartbeat to the NN every 3 seconds (or as configured), which allows the NN a chance to respond with commands such as "delete replica" if a block has become over-replicated, or "copy replica to this other DN" if a block needs further replication. And the DN initiates a BlockReport to the NN every hour (or as configured), which prevents any divergence in the NN and DN belief about which replicas are held by each datanode. And yes, it also sends an immediate blockReceived notification whenever it receives a new block, whether from a Client (file create/append), or from another Datanode (block replication). "A blockReport() is also issued periodically as a portion of the HeartBeat." Not exactly. The DN's "heartbeat thread" takes care of sending both, at the appropriate time intervals, but they are separate RPCs to the NN.
        Hide
        Rick Kazman added a comment -

        Matt,

        Thanks for the comments. I have implemented almost all of your suggestions (although you might want to review to ensure that my wording captures your intention consistently).

        The only one I haven't implemented yet is: "Should probably also mention the bundle of threads that provide the Namenode's RPC service, which receives and processes all 13 kinds of communication from Datanodes and Clients." Can you suggest some more precise wording for this? I don't want to mis-represent. Should this be a fourth enumerated entry under:
        1. DataNode Health Management
        2. Replica Management
        3. Lease Management
        ????

        Show
        Rick Kazman added a comment - Matt, Thanks for the comments. I have implemented almost all of your suggestions (although you might want to review to ensure that my wording captures your intention consistently). The only one I haven't implemented yet is: "Should probably also mention the bundle of threads that provide the Namenode's RPC service, which receives and processes all 13 kinds of communication from Datanodes and Clients." Can you suggest some more precise wording for this? I don't want to mis-represent. Should this be a fourth enumerated entry under: 1. DataNode Health Management 2. Replica Management 3. Lease Management ????
        Hide
        Matt Foley added a comment -

        Okay. Yesterday I didn't notice that section 5.2.7 incorrectly assumes a NN-centric model of communication with the DN. Remember, the NN never initiates communication with the DN, it's always the other way around. I'm attaching a proposed change to 5.2.7, with new sections for the RPC Worker Threads.

        Please use the MS Word "Track Changes" tool to review the proposed changes. Thanks.

        Show
        Matt Foley added a comment - Okay. Yesterday I didn't notice that section 5.2.7 incorrectly assumes a NN-centric model of communication with the DN. Remember, the NN never initiates communication with the DN, it's always the other way around. I'm attaching a proposed change to 5.2.7, with new sections for the RPC Worker Threads. Please use the MS Word "Track Changes" tool to review the proposed changes. Thanks.
        Hide
        Todd Lipcon added a comment -

        any chance this could be moved over to a text-based format (eg LaTeX, forrest, asciidoc, etc)? We can't check in a .docx file

        Show
        Todd Lipcon added a comment - any chance this could be moved over to a text-based format (eg LaTeX, forrest, asciidoc, etc)? We can't check in a .docx file
        Hide
        Rick Kazman added a comment -

        Todd,

        Right now I am maintaining a Word version and an HTML version (the HTML version is hosted at (http://kazman.shidler.hawaii.edu/ArchDoc.html for the moment). The advantage of the Word doc is that it makes change tracking and commenting easy and there are already quite a few comments and changes (proposed by Matt Foley). The advantage of HTML is that it makes publishing and tracking easy. As part of my research I need to track who is using the documentation, how, and how often.

        But having two versions of course creates a version control issue for me, since I need to manually keep the two versions consistent--not good. I am happy to try Forrest (and do whatever makes most sense for the project) but I haven't used it before. Suggestions?

        Show
        Rick Kazman added a comment - Todd, Right now I am maintaining a Word version and an HTML version (the HTML version is hosted at ( http://kazman.shidler.hawaii.edu/ArchDoc.html for the moment). The advantage of the Word doc is that it makes change tracking and commenting easy and there are already quite a few comments and changes (proposed by Matt Foley). The advantage of HTML is that it makes publishing and tracking easy. As part of my research I need to track who is using the documentation, how, and how often. But having two versions of course creates a version control issue for me, since I need to manually keep the two versions consistent--not good. I am happy to try Forrest (and do whatever makes most sense for the project) but I haven't used it before. Suggestions?
        Hide
        Todd Lipcon added a comment -

        Hi Rick. I'm glad to see someone producing the document, but curious what the intention is. Is this meant to be checked in to the codebase, and is the expectation that contributors will update it as architecture changes? If so, the Word/HTML pair is problematic since it can't be version-controlled in subversion, and many people can't edit it (I don't have a copy of Word).

        In terms of tracking who is using the docs, if it's checked in, then you won't be able to tell that either since most people will access it locally.

        Perhaps the best route here is for this to not become part of Hadoop, but just to live on your site? We could include a link to it from our documentation in some kind of "Other resources" page, for example.

        Show
        Todd Lipcon added a comment - Hi Rick. I'm glad to see someone producing the document, but curious what the intention is. Is this meant to be checked in to the codebase, and is the expectation that contributors will update it as architecture changes? If so, the Word/HTML pair is problematic since it can't be version-controlled in subversion, and many people can't edit it (I don't have a copy of Word). In terms of tracking who is using the docs, if it's checked in, then you won't be able to tell that either since most people will access it locally. Perhaps the best route here is for this to not become part of Hadoop, but just to live on your site? We could include a link to it from our documentation in some kind of "Other resources" page, for example.
        Hide
        Rick Kazman added a comment -

        Todd,

        I'm a researcher. My interest is to understand what the impact of architectural documentation is on a large, distributed project--who uses it, does it change how people interact and communicate, does it improve product quality over time (since there is a shared representation of design intent), does it make it easier to recruit new contributors or have contributors "grow" into committers, etc.

        I am sincere in my intent to create the best possible architectural documentation but eventually, if it is to be useful, the HDFS community needs to "own" this.

        Currently, in addition to posting the Word version to Jira, I put a link to the HTML version of the architecture documentation at the bottom of the "Related Resources" section of the wiki at http://wiki.apache.org/hadoop/

        What do you recommend? Of course, we can do one thing in the short term and another in the long term, but in the end the community needs to maintain this document, not me.

        Show
        Rick Kazman added a comment - Todd, I'm a researcher. My interest is to understand what the impact of architectural documentation is on a large, distributed project--who uses it, does it change how people interact and communicate, does it improve product quality over time (since there is a shared representation of design intent), does it make it easier to recruit new contributors or have contributors "grow" into committers, etc. I am sincere in my intent to create the best possible architectural documentation but eventually, if it is to be useful, the HDFS community needs to "own" this. Currently, in addition to posting the Word version to Jira, I put a link to the HTML version of the architecture documentation at the bottom of the "Related Resources" section of the wiki at http://wiki.apache.org/hadoop/ What do you recommend? Of course, we can do one thing in the short term and another in the long term, but in the end the community needs to maintain this document, not me.
        Hide
        Matt Foley added a comment -

        Wouldn't this kind of document go in the Wiki rather than the code base? I assumed you were starting it this way to get it into passable shape before posting it on the Wiki.

        Is there a "beta" area of the Wiki, for docs that aren't yet ready for prime time?

        The Wiki provides history (like version control) and allows comment in the form of edits. Not ideal for tracking, but perhaps workable?

        Regarding format, the Word doc can of course generate (ugly) HTML.
        Some Wikis have a WHYSIWYG editor, and some of them allow MSWord paste-in.
        I don't know if ours does - never tried it

        Show
        Matt Foley added a comment - Wouldn't this kind of document go in the Wiki rather than the code base? I assumed you were starting it this way to get it into passable shape before posting it on the Wiki. Is there a "beta" area of the Wiki, for docs that aren't yet ready for prime time? The Wiki provides history (like version control) and allows comment in the form of edits. Not ideal for tracking, but perhaps workable? Regarding format, the Word doc can of course generate (ugly) HTML. Some Wikis have a WHYSIWYG editor, and some of them allow MSWord paste-in. I don't know if ours does - never tried it
        Hide
        Rick Kazman added a comment -

        Matt

        I just uploaded a version of your commented Word document with my changes and comments. Please take a look, if you have a chance. I also moved all of these changes over to the HTML version at http://kazman.shidler.hawaii.edu/ArchDoc.html so, for now, everything is in synch.

        I think that your suggestion of using the Wiki going forward makes sense. It seems like the right place for documentation and, as you say, allows for some version control and commenting. I will look into this over the weekend.

        Show
        Rick Kazman added a comment - Matt I just uploaded a version of your commented Word document with my changes and comments. Please take a look, if you have a chance. I also moved all of these changes over to the HTML version at http://kazman.shidler.hawaii.edu/ArchDoc.html so, for now, everything is in synch. I think that your suggestion of using the Wiki going forward makes sense. It seems like the right place for documentation and, as you say, allows for some version control and commenting. I will look into this over the weekend.
        Hide
        Rick Kazman added a comment -

        I have updated the HDFS 0.21 architectural documentation at http://kazman.shidler.hawaii.edu/ArchDoc.html to include a module view of the architecture. This representation depicts the HDFS codebase a 10 interconnected modules. It also highlights some unhealthy dependencies between modules that can limit system evolution.

        As always, comments appreciated!

        Show
        Rick Kazman added a comment - I have updated the HDFS 0.21 architectural documentation at http://kazman.shidler.hawaii.edu/ArchDoc.html to include a module view of the architecture. This representation depicts the HDFS codebase a 10 interconnected modules. It also highlights some unhealthy dependencies between modules that can limit system evolution. As always, comments appreciated!

          People

          • Assignee:
            Rick Kazman
            Reporter:
            Rick Kazman
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development