Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.22.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      This is the design for automatic hot HA for HDFS NameNode. It involves use of HA software and LoadReplicator - external to Hadoop components, which substantially simplify the architecture by separating HA- from Hadoop-specific problems. Without the external components it provides warm standby with manual failover.

      1. failover-v-0.22.patch
        49 kB
        Konstantin Shvachko
      2. failover-v-0.22.patch
        24 kB
        Konstantin Shvachko
      3. failover-v-0.22.patch
        24 kB
        Konstantin Shvachko
      4. WarmHA-GoingHot.pdf
        4.45 MB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Konstantin Shvachko added a comment -

          The design document. This is generally in line with ideas laid out in HDFS-1623, and can be considered as a part of it. Although it is self sufficient imo, and will cover most practical cases.

          Show
          Konstantin Shvachko added a comment - The design document. This is generally in line with ideas laid out in HDFS-1623 , and can be considered as a part of it. Although it is self sufficient imo, and will cover most practical cases.
          Hide
          Robert Chansler added a comment -

          When I read K's paper, I found that it did generally fit the model discussed among ourselves and in HDFS:1623. I really would consider it a specialization of the model. I've re-read both K's and Sanjay+Suresh's papers medium carefully to get a better sense of differences.

          1. In K7.8 there is a discussion of whether the NN should more proactively look for missing replicas. I don't remember discussing this, but my first thought is that this is an instance of trying too hard at the margin. During what fraction of the system's lifetime would this help?

          2. K7.6 mentions turning off lease recovery, but also replication checking as in SS9.3.1.

          3. What is the scope of VIP solutions? This is the "single switch" question. A while ago, we got into trouble when VIP did not just work with HDFS. More recently we got into trouble DNS resolution was cached, but when I asked Rajiv why VIP wasn't the answer, he said that they could not (in general) provide an alternate host with the NN's VIP. Is everybody confident that single-switch VIP works well enough for HA? (When I ask that question of Rajiv, he says yes.)

          4. We've been very anxious about the stale deletion request problem where a DN has a request from the old NN that has not been reported to the NN now in service. This is hinted at in SS9.1.2, but I don't think this is fully understood yet. SS goes further into the topic of "data node fencing." Sanjay and I disagree on the merits. I'd argue that DNs should just do as they are told, and not try to mediate sibling disputes among NNs.

          5. NN arbitration really is important. K hints somewhere (can't find it now) that the old NN must be stopped. SS are more emphatic. I'd say do STONITH. This becomes more important anywhere near the word "automatic."

          6. SS9.2 mentions "leader election". Is the world really symmetric? XXX^1^ denied that symmetry was a good thing. Any specific proposal needs to address the question of how alike the first and second systems are, and whether the process runs backward.

          7. Load Replicator in K6.2 is a new contribution to the discussion. This bears on the issue #4, above.

          8. Where K really diverges from most discussion here is over the question of Backup name node versus spooling edits on secondary storage. I mostly understand the issues, but in a practical BN deployment, is there a remaining need for some shared storage?

          Why, yes, you could do differently, but a practical solution has,

          • VIP
          • LinuxHA
          • No Zookeeper
          • STONITH
          • Only transfer one way without administrator intervention

          The open argument in my mind is BN versus spool-to-disk. Oh, and if the LR
          really means DNs need not know that there are multiple servers, life is
          delightfully simpler.

          1 My memory is uncertain about the proper attribution here.

          Show
          Robert Chansler added a comment - When I read K's paper, I found that it did generally fit the model discussed among ourselves and in HDFS:1623. I really would consider it a specialization of the model. I've re-read both K's and Sanjay+Suresh's papers medium carefully to get a better sense of differences. 1. In K7.8 there is a discussion of whether the NN should more proactively look for missing replicas. I don't remember discussing this, but my first thought is that this is an instance of trying too hard at the margin. During what fraction of the system's lifetime would this help? 2. K7.6 mentions turning off lease recovery, but also replication checking as in SS9.3.1. 3. What is the scope of VIP solutions? This is the "single switch" question. A while ago, we got into trouble when VIP did not just work with HDFS. More recently we got into trouble DNS resolution was cached, but when I asked Rajiv why VIP wasn't the answer, he said that they could not (in general) provide an alternate host with the NN's VIP. Is everybody confident that single-switch VIP works well enough for HA? (When I ask that question of Rajiv, he says yes.) 4. We've been very anxious about the stale deletion request problem where a DN has a request from the old NN that has not been reported to the NN now in service. This is hinted at in SS9.1.2, but I don't think this is fully understood yet. SS goes further into the topic of "data node fencing." Sanjay and I disagree on the merits. I'd argue that DNs should just do as they are told, and not try to mediate sibling disputes among NNs. 5. NN arbitration really is important. K hints somewhere (can't find it now) that the old NN must be stopped. SS are more emphatic. I'd say do STONITH. This becomes more important anywhere near the word "automatic." 6. SS9.2 mentions "leader election". Is the world really symmetric? XXX^1^ denied that symmetry was a good thing. Any specific proposal needs to address the question of how alike the first and second systems are, and whether the process runs backward. 7. Load Replicator in K6.2 is a new contribution to the discussion. This bears on the issue #4, above. 8. Where K really diverges from most discussion here is over the question of Backup name node versus spooling edits on secondary storage. I mostly understand the issues, but in a practical BN deployment, is there a remaining need for some shared storage? Why, yes, you could do differently, but a practical solution has, VIP LinuxHA No Zookeeper STONITH Only transfer one way without administrator intervention The open argument in my mind is BN versus spool-to-disk. Oh, and if the LR really means DNs need not know that there are multiple servers, life is delightfully simpler. 1 My memory is uncertain about the proper attribution here.
          Hide
          Konstantin Shvachko added a comment -

          Rob, thanks for the design review. Some clarifications and comments.
          Yes, I consider this design as a specification of Sanjay's and Suresh's design. In the sense that it is a minimalistic (in terms of changes to the existing code) approach dedicated to one direction - building HA based on StandbyNode.

          > 1. K7.8
          Accelerating block reports after failover is indeed an optimization. Good point, during normal operations both BlockMaps should be in sync. And acceleration is targeted the case when SBN misses a lot of block reports, which could be monitored on the SBN webUI or via metrics.

          > 3. What is the scope of VIP solutions?
          Talking to different people I came into conclusion that failover within one rack is sufficient.

          • First of all, VIP is a good abstraction and if current implementation does not satisfy certain needs, the networking industry will find a way to innovate.
          • Second, the rack that runs NN and SBN can be designed more reliably than regular (DataNode) racks. With 2 TOR switches. With bonded interfaces (forgive me if I get the terminology wrong) inside the rack and outside for fault tolerance.
          • Third, there are disasters that require a 9.0 magnitude earthquake followed by a tsunami to happen. Should Hadoop be designed for that? Probably not. I just need to hit 99.94 availability mark.

          > 4. the stale deletion request problem
          I hoped I covered it in 7.9. But I see now that this section needs more details and I missed the third important case, when setReplication() is explicitly decreasing the replication. I hope we can solve it by adding replica locations to logSetReplication(). I'll update this section.

          > 6. "leader election". Is the world really symmetric?
          NN and SBN are asymmetric. And this simplifies things a lot: I am active NN if I have the nn.vip. Some other node can think it is active, but since it doesn't have the vip, her ambitions don't matter as nobody can to talk to her. Asymmetric design eliminates leader election and client fencing. It's a good thing.

          > 8. spooling edits on secondary storage.
          By secondary storage you mean a filer or a Bookkeeper I think. Filer is an enterprise storage. We are building a distributed storage system based on commodity components, and adding a dependency on enterprise storage seams counterintuitive to me. Any shared storage solution will require solving synchronization problems, see my note about addBlock() in 7.5. If blockReceived() arrives to SBN before addBlock() this replica is lost for another hour. addBlock() must be synchronous in order to avoid such race condition.

          > in a practical BN deployment, is there a remaining need for some shared storage?
          I don't see any.

          Show
          Konstantin Shvachko added a comment - Rob, thanks for the design review. Some clarifications and comments. Yes, I consider this design as a specification of Sanjay's and Suresh's design. In the sense that it is a minimalistic (in terms of changes to the existing code) approach dedicated to one direction - building HA based on StandbyNode. > 1. K7.8 Accelerating block reports after failover is indeed an optimization. Good point, during normal operations both BlockMaps should be in sync. And acceleration is targeted the case when SBN misses a lot of block reports, which could be monitored on the SBN webUI or via metrics. > 3. What is the scope of VIP solutions? Talking to different people I came into conclusion that failover within one rack is sufficient. First of all, VIP is a good abstraction and if current implementation does not satisfy certain needs, the networking industry will find a way to innovate. Second, the rack that runs NN and SBN can be designed more reliably than regular (DataNode) racks. With 2 TOR switches. With bonded interfaces (forgive me if I get the terminology wrong) inside the rack and outside for fault tolerance. Third, there are disasters that require a 9.0 magnitude earthquake followed by a tsunami to happen. Should Hadoop be designed for that? Probably not. I just need to hit 99.94 availability mark. > 4. the stale deletion request problem I hoped I covered it in 7.9. But I see now that this section needs more details and I missed the third important case, when setReplication() is explicitly decreasing the replication. I hope we can solve it by adding replica locations to logSetReplication(). I'll update this section. > 6. "leader election". Is the world really symmetric? NN and SBN are asymmetric. And this simplifies things a lot: I am active NN if I have the nn.vip. Some other node can think it is active, but since it doesn't have the vip, her ambitions don't matter as nobody can to talk to her. Asymmetric design eliminates leader election and client fencing. It's a good thing. > 8. spooling edits on secondary storage. By secondary storage you mean a filer or a Bookkeeper I think. Filer is an enterprise storage. We are building a distributed storage system based on commodity components, and adding a dependency on enterprise storage seams counterintuitive to me. Any shared storage solution will require solving synchronization problems, see my note about addBlock() in 7.5. If blockReceived() arrives to SBN before addBlock() this replica is lost for another hour. addBlock() must be synchronous in order to avoid such race condition. > in a practical BN deployment, is there a remaining need for some shared storage? I don't see any.
          Hide
          M. C. Srivas added a comment -

          @konstantin: I really liked the simplicity of this design. Nice work!

          This might be my own lack of knowledge: is the SBN in charge of the fsimage file? If it starts much later than the NN, does it fetch both the fsimage file and the edits-log from the NN?

          Show
          M. C. Srivas added a comment - @konstantin: I really liked the simplicity of this design. Nice work! This might be my own lack of knowledge: is the SBN in charge of the fsimage file? If it starts much later than the NN, does it fetch both the fsimage file and the edits-log from the NN?
          Hide
          Dennis Oberhoff added a comment -

          No not yet. Currently you have to start the Node as BackUp Node. This will import the fsimage to the SBN. After it you have to kill the Node and restart it as Standby. Right now -importCheckpoint doesn't seem to do the work.Would be nice if this procedure could be implemented into the Standby Mode.

          Show
          Dennis Oberhoff added a comment - No not yet. Currently you have to start the Node as BackUp Node. This will import the fsimage to the SBN. After it you have to kill the Node and restart it as Standby. Right now -importCheckpoint doesn't seem to do the work.Would be nice if this procedure could be implemented into the Standby Mode.
          Hide
          Konstantin Shvachko added a comment -

          > If it starts much later than the NN, does it fetch both the fsimage file and the edits-log from the NN?

          Yes it does. The patch that I attached has a SBN, which is an evolution BN. In current implementation BN during registration with NN determines if it has an old image and uploads the new one (along with edits). SBN will do the same thing. No need to start as BN and then restart as SBN, as Dennis states.

          Dennis, -importCheckpoint has a different purpose. If you have a separate image directory, which is not the name or edits directory for your NN, you can import data from the checkpoint directory and NN will save that state in its name directories. The NN's name directories should be empty before you do -importCheckpoint.

          Show
          Konstantin Shvachko added a comment - > If it starts much later than the NN, does it fetch both the fsimage file and the edits-log from the NN? Yes it does. The patch that I attached has a SBN, which is an evolution BN. In current implementation BN during registration with NN determines if it has an old image and uploads the new one (along with edits). SBN will do the same thing. No need to start as BN and then restart as SBN, as Dennis states. Dennis, -importCheckpoint has a different purpose. If you have a separate image directory, which is not the name or edits directory for your NN, you can import data from the checkpoint directory and NN will save that state in its name directories. The NN's name directories should be empty before you do -importCheckpoint.
          Hide
          Konstantin Shvachko added a comment -

          Updated patch for current branch state.

          Show
          Konstantin Shvachko added a comment - Updated patch for current branch state.
          Hide
          Suresh Srinivas added a comment -

          I am not sure why this is marked Blocker. I am changing the priority to Major.

          Show
          Suresh Srinivas added a comment - I am not sure why this is marked Blocker. I am changing the priority to Major.
          Hide
          Matt Foley added a comment -

          HDFS-978 was resolved as a dup in favor of HDFS-1108.

          Show
          Matt Foley added a comment - HDFS-978 was resolved as a dup in favor of HDFS-1108 .
          Hide
          Eli Collins added a comment -

          Hey Konst,
          Are you still working on this or should we close it out?

          Show
          Eli Collins added a comment - Hey Konst, Are you still working on this or should we close it out?
          Hide
          Konstantin Shvachko added a comment -

          I think this jira still makes sense. I'll close it when I see it doesn't.

          Show
          Konstantin Shvachko added a comment - I think this jira still makes sense. I'll close it when I see it doesn't.
          Hide
          Konstantin Shvachko added a comment -

          Latest patch for branch 0.22. The previous one got stale. In case anybody wants to try it.

          Show
          Konstantin Shvachko added a comment - Latest patch for branch 0.22. The previous one got stale. In case anybody wants to try it.
          Hide
          Konstantin Shvachko added a comment -

          Closing this in favor of HDFS-6469.

          Show
          Konstantin Shvachko added a comment - Closing this in favor of HDFS-6469 .

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              2 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development