Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2601

Proposal to store edits and checkpoints inside HDFS itself for namenode

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      Would have liked to make this a "brainstorming" JIRA but couldn't find the option for some reason.

      I have talked to a quite a few people about this proposal at Facebook internally (HDFS folks like Hairong and Dhruba, as well as HBase folks interested in this feature), and wanted to broaden the audience.

      At the core of the HA feature, we need 2 things:
      A. the secondary NN (or avatar stand-by or whatever we call it) needs to read all the fsedits and fsimage data written by the primary NN
      B. Once the stand-by has taken over, the old NN should not be allowed to make any edits

      The basic idea is as follows (there are some variants, we can hone in on the details if we like the general approach):

      1. The write path for fsedits and fsimage:

      1.1 The NN uses a dfs client to write fsedits and fsimage. These will be regular hdfs files written using the write pipeline.
      1.2 Let us say the fsimage and fsedits files are written to a well-known location in the local HDFS itself (say /.META or some such location)
      1.3 The create files and add blocks to files in this path are not written to fsimage or fsedits. The location of the blocks for the files in this location are known to all namenodes - primary and standby - somehow (some possibilities here - write these block ids to zk or use reserved block ids or write some meta-data into the blocks itself and store the blocks in a well known location on all the datanodes)
      1.4 If the replication factor on the write pipeline decreases, we close the block immediately and allow NN to re-replicate to bring up the replication factor. We continue writing to a new block

      2. The read path on a NN failure
      2.1 Since the new NN "knows" the location of the blocks for the fsedits and fsimage (again the same possibilities as mentioned above), there is nothing to do to determine this
      2.2 It can read the files it needs using the HDFS client itself

      3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not be allowed to perform any action
      3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN will close all these files baing written to by the old NN (and hence all the blocks)
      3.2 The new NN (avatar NN) will write its address into ZK to let everyone know its the master
      3.3 The new NN now gets the lease for these files and starts writing into the fsedits and fsimage
      3.4 The old NN cannot write into the file as the block it was writing to was closed and it does not have the lease. If it needs to re-open these files, it needs to check zk to see it is indeed the current master, if not it should exit.

      4. Misc considerations:
      4.1 If needed, we can specify favored nodes to place the blocks for this data in specific set of nodes (say we want to use a different set of RAIDed nodes, etc).
      4.2 Since we wont record the entries for /.META in fsedits and fsimage, a "hadoop dfs -ls /" command wont show the files. This is probably ok, and can be fixed if not.
      4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block ids - the NN would need only these 80 block ids to read all the fsedits data. The fsimage data is even lesser. This is very tractable using a variety of the techniques (the possibilities mentioned above).

      The advantage is that we are re-using the existing HDFS client (with some enhancements of course), and making the solution self-sufficient on the existing HDFS. Also, the operational complexity is greatly reduced.

      Thoughts?

        Activity

        Hide
        Karthik Ranganathan added a comment -

        @Sanjay - done, removed the HA from the title.

        @Hari - a little confused about the write latency on sync versus async. By definition, I think we cant go the async route right - because the nn has to complete a transaction (which involved writing to fsedits) before ack-ing to the client.

        Comparing this pipeline solution with bookkeeper, I am trying to understand how the BK based solution would do any better - because the nn would still have to wait for the BK system to ack that it has written to the logical fsedits file on 3 different machines.

        Another data point is that HBase today writes the its Write-Ahead Log (similar in function to fsedits) - to HDFS as a file, and the latencies do not seem too bad. The 2 things that HBase does are:

        • group commit a bunch of transactions instead of making one RPC per edit
        • once the replication pipeline drops to below 3, it immediately closes the file so that nn will re-replicate.

        Wondering if a similar issue would work for the HDFS case - the only perf related enhancement we might need is parallel writes from the client.

        Show
        Karthik Ranganathan added a comment - @Sanjay - done, removed the HA from the title. @Hari - a little confused about the write latency on sync versus async. By definition, I think we cant go the async route right - because the nn has to complete a transaction (which involved writing to fsedits) before ack-ing to the client. Comparing this pipeline solution with bookkeeper, I am trying to understand how the BK based solution would do any better - because the nn would still have to wait for the BK system to ack that it has written to the logical fsedits file on 3 different machines. Another data point is that HBase today writes the its Write-Ahead Log (similar in function to fsedits) - to HDFS as a file, and the latencies do not seem too bad. The 2 things that HBase does are: group commit a bunch of transactions instead of making one RPC per edit once the replication pipeline drops to below 3, it immediately closes the file so that nn will re-replicate. Wondering if a similar issue would work for the HDFS case - the only perf related enhancement we might need is parallel writes from the client.
        Hide
        Hari Mankude added a comment -

        Karthik,

        One of things that is not explicitly mentioned is whether edit log writes/checkpoints are written synchronously or asynchronously to HDFS. A synchronous write of edit log entries might have write latency issues as mentioned by Sanjay.

        Asynchronous writes are also acceptable and HDFS write latency issues are not very critical with async writes. The disadvantage of asynchronously writing edit log is that the edit log in HDFS will lag the actual log and could result in loss of data if HDFS based image has to be used.

        However, having more than one copy of fsimage/edit logs in HDFS improves the recoverability of HDFS immensely (without total loss of data) in case of catastrophic NN failures. In fact, storing multiple checkpoints with the associated edit logs also will help in rollback situations if there is metadata corruption at NN.

        Show
        Hari Mankude added a comment - Karthik, One of things that is not explicitly mentioned is whether edit log writes/checkpoints are written synchronously or asynchronously to HDFS. A synchronous write of edit log entries might have write latency issues as mentioned by Sanjay. Asynchronous writes are also acceptable and HDFS write latency issues are not very critical with async writes. The disadvantage of asynchronously writing edit log is that the edit log in HDFS will lag the actual log and could result in loss of data if HDFS based image has to be used. However, having more than one copy of fsimage/edit logs in HDFS improves the recoverability of HDFS immensely (without total loss of data) in case of catastrophic NN failures. In fact, storing multiple checkpoints with the associated edit logs also will help in rollback situations if there is metadata corruption at NN.
        Hide
        Sanjay Radia added a comment -

        Putting the Image in HDFS should be a lot simpler and does not require addressing the latency issue. The checkpointer could write the image directly to HDFS; this is feasible due to the image/edits cleanup (HDFS-1073). There is a bootstrapping issue on NN startup but that should be solvable.

        Show
        Sanjay Radia added a comment - Putting the Image in HDFS should be a lot simpler and does not require addressing the latency issue. The checkpointer could write the image directly to HDFS; this is feasible due to the image/edits cleanup ( HDFS-1073 ). There is a bootstrapping issue on NN startup but that should be solvable.
        Hide
        Sanjay Radia added a comment -

        Makes a lot of sense; HA is just one of its benefits (as Todd states); i suggest we take HA out of the JIRA title but record it as one of the benfits.
        Several folks have talked about this in the past - the main issue was write latency and keeping the metadata in
        an independent storage system.

        Show
        Sanjay Radia added a comment - Makes a lot of sense; HA is just one of its benefits (as Todd states); i suggest we take HA out of the JIRA title but record it as one of the benfits. Several folks have talked about this in the past - the main issue was write latency and keeping the metadata in an independent storage system.
        Hide
        Todd Lipcon added a comment -

        Regarding the stability of the write pipeline, the only concern is with concurrent failure + write pipeline recovery. This is one of the areas where HA gets tricky, and we'd need to make sure it's bullet-proof if we intend to store edit logs there. Since the edit log fencing has to work perfectly during a failover, it's a bit of uncharted territory.

        Show
        Todd Lipcon added a comment - Regarding the stability of the write pipeline, the only concern is with concurrent failure + write pipeline recovery. This is one of the areas where HA gets tricky, and we'd need to make sure it's bullet-proof if we intend to store edit logs there. Since the edit log fencing has to work perfectly during a failover, it's a bit of uncharted territory.
        Hide
        Karthik Ranganathan added a comment -

        @Andrew - just trying to understand a couple of points in more detail.

        << HDFS is stable, but not for this use case, yet, and we'd need a lot of work to get it there. >>

        Are you talking about the fact that there would be bugs in HDFS as is so those need to be fixed, or the code complexity of implementing this approach? If the former, can you elaborate? If the latter, then we can go ahead with evaluating this approach and if we like it, then do an evaluation of the code complexity and make a decision depending on what we find.

        << And we still end up with a stack that, like NFS, is a gazillion lines of code, but handles a lot of different stuff. Most of which we don't need for the image/edits application. >>

        I was thinking the opposite - that the HDFS write pipeline and recovery logic is battle-tested (in many respects - architecture, failure mode, code, actual usage) at such a large scale that it makes it rather attractive for such a critical piece of data as opposed to writing something new (however simple) and having to prove it works from the scratch?

        Show
        Karthik Ranganathan added a comment - @Andrew - just trying to understand a couple of points in more detail. << HDFS is stable, but not for this use case, yet, and we'd need a lot of work to get it there. >> Are you talking about the fact that there would be bugs in HDFS as is so those need to be fixed, or the code complexity of implementing this approach? If the former, can you elaborate? If the latter, then we can go ahead with evaluating this approach and if we like it, then do an evaluation of the code complexity and make a decision depending on what we find. << And we still end up with a stack that, like NFS, is a gazillion lines of code, but handles a lot of different stuff. Most of which we don't need for the image/edits application. >> I was thinking the opposite - that the HDFS write pipeline and recovery logic is battle-tested (in many respects - architecture, failure mode, code, actual usage) at such a large scale that it makes it rather attractive for such a critical piece of data as opposed to writing something new (however simple) and having to prove it works from the scratch?
        Hide
        Karthik Ranganathan added a comment -

        @Dhruba - totally, will definitely scope out the changes to HDFS, but deferring that till we have a bit more discussion - to see if we have other opinions about this approach in general.

        Show
        Karthik Ranganathan added a comment - @Dhruba - totally, will definitely scope out the changes to HDFS, but deferring that till we have a bit more discussion - to see if we have other opinions about this approach in general.
        Hide
        Eli Collins added a comment -

        There are also non-shared storage approaches (eg HDFS-2064). We're pushing on using NFS in the near-term in the HDFS-1623 sub-tasks. Relative to the cost of a cluster the capex of a suitable filer (eg a FAS2040) isn't actually that high. We could also do things to lower this requirement (eg have the NN better handle and restore from flaky NFS mounts). In the medium term to long term I think having HDFS expose it's block layer so it can boostrap its namespace storage (and expose to other clients like HBase) is the way to go. Not having the operational complexity of a separate system (a filer, BK, etc) is a big win.

        Show
        Eli Collins added a comment - There are also non-shared storage approaches (eg HDFS-2064 ). We're pushing on using NFS in the near-term in the HDFS-1623 sub-tasks. Relative to the cost of a cluster the capex of a suitable filer (eg a FAS2040) isn't actually that high. We could also do things to lower this requirement (eg have the NN better handle and restore from flaky NFS mounts). In the medium term to long term I think having HDFS expose it's block layer so it can boostrap its namespace storage (and expose to other clients like HBase) is the way to go. Not having the operational complexity of a separate system (a filer, BK, etc) is a big win.
        Hide
        Andrew Ryan added a comment -

        Following on Todd's comment, it seems to me we now have proposals floating around to store the image/edits data in one of three different locations: shared storage (NFS), Bookeeper, and HDFS.

        NFS is stable and mature, but good, reliable, redundant NFS hardware is expensive and proprietary. The NFS filesystem has a lot of features which we don't need for this use case. There are other shared storage filesystems out there too, but as far as I know none of them are in wide use for storing image/edits data, so I'm only mentioning NFS.

        HDFS is stable, but not for this use case, yet, and we'd need a lot of work to get it there. And we still end up with a stack that, like NFS, is a gazillion lines of code, but handles a lot of different stuff. Most of which we don't need for the image/edits application.

        Bookeeper doesn't exist yet, but I like that it's special-purpose written just to provide a minimal set of features to enable the image/edits scenario. So hopefully it won't be a gazillion lines of code when it's done. But it will require a lot of time to stabilize and prove itself.

        NFS as shared storage for image/edits works today. It's the basis of our HA namenode strategy at Facebook for the forseeable future. It's hard to debate the merits of HDFS vs. BK for image/edits storage, since neither exists, we're comparing unicorns to leprechauns. But my operational instincts and experience running Hadoop tell me that either BK or a separate HDFS would be best. But I'd need to better understand what the operational characteristics of each system were, and these are not well-defined yet.

        I'm looking forward to more discussion and hearing various viewpoints.

        Show
        Andrew Ryan added a comment - Following on Todd's comment, it seems to me we now have proposals floating around to store the image/edits data in one of three different locations: shared storage (NFS), Bookeeper, and HDFS. NFS is stable and mature, but good, reliable, redundant NFS hardware is expensive and proprietary. The NFS filesystem has a lot of features which we don't need for this use case. There are other shared storage filesystems out there too, but as far as I know none of them are in wide use for storing image/edits data, so I'm only mentioning NFS. HDFS is stable, but not for this use case, yet, and we'd need a lot of work to get it there. And we still end up with a stack that, like NFS, is a gazillion lines of code, but handles a lot of different stuff. Most of which we don't need for the image/edits application. Bookeeper doesn't exist yet, but I like that it's special-purpose written just to provide a minimal set of features to enable the image/edits scenario. So hopefully it won't be a gazillion lines of code when it's done. But it will require a lot of time to stabilize and prove itself. NFS as shared storage for image/edits works today. It's the basis of our HA namenode strategy at Facebook for the forseeable future. It's hard to debate the merits of HDFS vs. BK for image/edits storage, since neither exists, we're comparing unicorns to leprechauns. But my operational instincts and experience running Hadoop tell me that either BK or a separate HDFS would be best. But I'd need to better understand what the operational characteristics of each system were, and these are not well-defined yet. I'm looking forward to more discussion and hearing various viewpoints.
        Hide
        dhruba borthakur added a comment -

        Can you pl detail the high-level changes needed to HDFS? At the least, HDFS needs to handle these meta blocks differently from other blocks. NN needs to persist locations for these meta blocks perhaps?

        Show
        dhruba borthakur added a comment - Can you pl detail the high-level changes needed to HDFS? At the least, HDFS needs to handle these meta blocks differently from other blocks. NN needs to persist locations for these meta blocks perhaps?
        Hide
        Karthik Ranganathan added a comment -

        Yes totally, I was not sure what to put in the title will update the title.

        Show
        Karthik Ranganathan added a comment - Yes totally, I was not sure what to put in the title will update the title.
        Hide
        Todd Lipcon added a comment -

        Makes sense to me, but it seems like this proposal is really about "store edits and checkpoints inside HDFS itself" – since the rest of the HA proposal remains the same as the shared storage approach outlined in HDFS-1623. It's just that the shared storage is made of HDFS instead of a NAS?

        Show
        Todd Lipcon added a comment - Makes sense to me, but it seems like this proposal is really about "store edits and checkpoints inside HDFS itself" – since the rest of the HA proposal remains the same as the shared storage approach outlined in HDFS-1623 . It's just that the shared storage is made of HDFS instead of a NAS?

          People

          • Assignee:
            Unassigned
            Reporter:
            Karthik Ranganathan
          • Votes:
            0 Vote for this issue
            Watchers:
            29 Start watching this issue

            Dates

            • Created:
              Updated:

              Development