Details

    • Hadoop Flags:
      Reviewed

      Description

      In high availability setup, with an active and standby namenode, there is a possibility of two namenodes sending commands to the datanode. The datanode must honor commands from only the active namenode and reject the commands from standby, to prevent corruption. This invariant must be complied with during fail over and other states such as split brain. This jira addresses issues related to this, design of the solution and implementation.

      1. hdfs-1972.txt
        77 kB
        Todd Lipcon
      2. hdfs-1972.txt
        74 kB
        Todd Lipcon
      3. hdfs-1972.txt
        46 kB
        Todd Lipcon
      4. hdfs-1972.txt
        47 kB
        Todd Lipcon
      5. hdfs-1972.txt
        40 kB
        Todd Lipcon
      6. hdfs-1972-v1.txt
        27 kB
        Todd Lipcon

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          214d 12h 1 Todd Lipcon 21/Dec/11 04:32
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #23 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/23/)
          HDFS-1972. Fencing mechanism for block invalidations and replications. Contributed by Todd Lipcon.

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221608
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/NumberReplicas.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReplicationBlocks.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSDatasetAsyncDiskService.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ActiveState.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/HAState.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyState.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/DataNodeAdapter.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #23 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/23/ ) HDFS-1972 . Fencing mechanism for block invalidations and replications. Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221608 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/InvalidateBlocks.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/NumberReplicas.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReplicationBlocks.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSDatasetAsyncDiskService.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ActiveState.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/HAState.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/StandbyState.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/DataNodeAdapter.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java
          Todd Lipcon made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s HA branch (HDFS-1623) [ 12317568 ]
          Resolution Fixed [ 1 ]
          Hide
          Todd Lipcon added a comment -

          Committed to HA branch. Thanks for the reviews

          Show
          Todd Lipcon added a comment - Committed to HA branch. Thanks for the reviews
          Hide
          Eli Collins added a comment -

          +1

          Looks great. Nice work! In particular the new tests are excellent.

          Show
          Eli Collins added a comment - +1 Looks great. Nice work! In particular the new tests are excellent.
          Todd Lipcon made changes -
          Attachment hdfs-1972.txt [ 12508141 ]
          Hide
          Todd Lipcon added a comment -

          Attached patch cleans up the tests a bit. I ran the stress test case for 10 minutes or so and it passed with these updates.

          Show
          Todd Lipcon added a comment - Attached patch cleans up the tests a bit. I ran the stress test case for 10 minutes or so and it passed with these updates.
          Hide
          Todd Lipcon added a comment -

          marking as blocked by HDFS-2693 since these tests will fail a lot of the time without the synchronization fixes in the other JIRA

          Show
          Todd Lipcon added a comment - marking as blocked by HDFS-2693 since these tests will fail a lot of the time without the synchronization fixes in the other JIRA
          Todd Lipcon made changes -
          Link This issue is blocked by HDFS-2693 [ HDFS-2693 ]
          Todd Lipcon made changes -
          Attachment hdfs-1972.txt [ 12508043 ]
          Hide
          Todd Lipcon added a comment -

          Here's a patch with more unit tests. I'd like to clean up the tests a bit to explain what's going on better, but I think the code is all finished and ready for review.

          Show
          Todd Lipcon added a comment - Here's a patch with more unit tests. I'd like to clean up the tests a bit to explain what's going on better, but I think the code is all finished and ready for review.
          Todd Lipcon made changes -
          Link This issue supercedes HDFS-2603 [ HDFS-2603 ]
          Hide
          Todd Lipcon added a comment -

          I think I will actually integrate HDFS-2603 into this, since the issues are tightly linked. (I have a new test case which illustrates the issue). I'll upload a new patch soon, but this version is still worth reviewing.

          Show
          Todd Lipcon added a comment - I think I will actually integrate HDFS-2603 into this, since the issues are tightly linked. (I have a new test case which illustrates the issue). I'll upload a new patch soon, but this version is still worth reviewing.
          Todd Lipcon made changes -
          Attachment hdfs-1972.txt [ 12507981 ]
          Hide
          Todd Lipcon added a comment -

          Rebased on trunk (no real changes - just one conflict which had gotten merged in already as part of another patch)

          Show
          Todd Lipcon added a comment - Rebased on trunk (no real changes - just one conflict which had gotten merged in already as part of another patch)
          Todd Lipcon made changes -
          Attachment hdfs-1972.txt [ 12507647 ]
          Hide
          Todd Lipcon added a comment -

          New patch addresses your comments. I changed the terminology from "untrusted" to "stale" throughout, and inverted the polarity of the boolean in DatanodeDescriptor.

          Show
          Todd Lipcon added a comment - New patch addresses your comments. I changed the terminology from "untrusted" to "stale" throughout, and inverted the polarity of the boolean in DatanodeDescriptor.
          Hide
          Eli Collins added a comment -

          Patch looks like a good implementation of the approach. Here's my initial comments. For those following along and wondering why the patch doesn't have the DNs ignore commands from the standby that part of the fencing was already done in HDFS-2627.

          I'd modify the comment by blockContentsTrusted to something like "DN may have some pending deletions issued by a prior NN that this NN is unaware of. Therefore we don't perform actions based on the contents of this DN until after we receive a BR followed by a heartbeat confirming the DN thought we were active, which means this NN is now uptodate with respect to this DN". Maybe revert the polarity and rename blockContentsStale, since we're really tracking whether the block contents are up-to-date?

          Update javadoc for NumberReplicas, good to define "untrusted", if a DN is considered untrusted then all replicas are considered unstrusted.

          Not your change but in BlockManager rename "count" to "decomissioned" and update the javadoc.

          In processMisReplicatedBlock a comment to the effect of (but better worded than) "countNodes counts all blocks from an unstrusted DN as untrusted (and all DNs start out unstrusted until their next heartbeat), however we only act on this mistrust if the block is over-replicated".

          Commment "If we have a least one" in invalidateBlock can be moved down to the 2nd if".

          I think it's OK to assume postponedMisreplicatedBlocks is always small.. I suppose even if we re-commisioning a rack and immediately fail-over this should be sufficient.

          Show
          Eli Collins added a comment - Patch looks like a good implementation of the approach. Here's my initial comments. For those following along and wondering why the patch doesn't have the DNs ignore commands from the standby that part of the fencing was already done in HDFS-2627 . I'd modify the comment by blockContentsTrusted to something like "DN may have some pending deletions issued by a prior NN that this NN is unaware of. Therefore we don't perform actions based on the contents of this DN until after we receive a BR followed by a heartbeat confirming the DN thought we were active, which means this NN is now uptodate with respect to this DN". Maybe revert the polarity and rename blockContentsStale, since we're really tracking whether the block contents are up-to-date? Update javadoc for NumberReplicas, good to define "untrusted", if a DN is considered untrusted then all replicas are considered unstrusted. Not your change but in BlockManager rename "count" to "decomissioned" and update the javadoc. In processMisReplicatedBlock a comment to the effect of (but better worded than) "countNodes counts all blocks from an unstrusted DN as untrusted (and all DNs start out unstrusted until their next heartbeat), however we only act on this mistrust if the block is over-replicated". Commment "If we have a least one" in invalidateBlock can be moved down to the 2nd if". I think it's OK to assume postponedMisreplicatedBlocks is always small.. I suppose even if we re-commisioning a rack and immediately fail-over this should be sufficient.
          Hide
          Todd Lipcon added a comment -

          Yes, like Dhruba said, that's what the patch does. The slight added complexities are:

          (a) track only the postponed over-replicated blocks to prevent having to take a lock and rescan all the blocks once the last DN checks in.
          (b) we need to actually have a heartbeat and then a BR from each DN. If after the NN becomes active we get a BR immediately, there's a short window where it might receive a deletion request prior to the next heartbeat.

          @Dhruba: I considered your trick of reprocessing all the replicated blocks while holding only the readlock. But, it seems this is still high-impact – holding the readlock for potentially 10-20 seconds will block many operations including getBlockLocations (which updates access time) as well as any namespace writes.

          Show
          Todd Lipcon added a comment - Yes, like Dhruba said, that's what the patch does. The slight added complexities are: (a) track only the postponed over-replicated blocks to prevent having to take a lock and rescan all the blocks once the last DN checks in. (b) we need to actually have a heartbeat and then a BR from each DN. If after the NN becomes active we get a BR immediately, there's a short window where it might receive a deletion request prior to the next heartbeat. @Dhruba: I considered your trick of reprocessing all the replicated blocks while holding only the readlock. But, it seems this is still high-impact – holding the readlock for potentially 10-20 seconds will block many operations including getBlockLocations (which updates access time) as well as any namespace writes.
          Hide
          dhruba borthakur added a comment -

          > would it be possible to defer block replication decisions at the newly

          that is precisely what this jira is doing. And only deleting-excess-replicas is deferred.

          Show
          dhruba borthakur added a comment - > would it be possible to defer block replication decisions at the newly that is precisely what this jira is doing. And only deleting-excess-replicas is deferred.
          Hide
          Hari Mankude added a comment -

          Suresh & I were discussing this and had couple of suggestions.

          1. Add the block replication decisions in editlog (this was shot down due to volume of editlog entries this might bring about)
          2. A more interesting solution suggested by Suresh was, would it be possible to defer block replication decisions at the newly active namenode until a new block report has arrived from the DNs? This would greatly simply the implementation.

          Show
          Hari Mankude added a comment - Suresh & I were discussing this and had couple of suggestions. 1. Add the block replication decisions in editlog (this was shot down due to volume of editlog entries this might bring about) 2. A more interesting solution suggested by Suresh was, would it be possible to defer block replication decisions at the newly active namenode until a new block report has arrived from the DNs? This would greatly simply the implementation.
          Hide
          Todd Lipcon added a comment -

          Yessir, thanks for summarizing better than I did

          Show
          Todd Lipcon added a comment - Yessir, thanks for summarizing better than I did
          Hide
          Hari Mankude added a comment -

          Agreed that stonith is not the only fencing solution. scsi reservations, file locking etc can be used if they are supported by the shared storage.

          The last paragraph describes the problem correctly. I missed this during my read of the comments. Essentially, block replication decisions (creating new copies or deleting extra copies) are not put in the edit log. Hence irrespective of shared storage fencing or not, we don't want newly active namenode to make replication decisions without understanding the complete state of blocks from all the DNs.

          Show
          Hari Mankude added a comment - Agreed that stonith is not the only fencing solution. scsi reservations, file locking etc can be used if they are supported by the shared storage. The last paragraph describes the problem correctly. I missed this during my read of the comments. Essentially, block replication decisions (creating new copies or deleting extra copies) are not put in the edit log. Hence irrespective of shared storage fencing or not, we don't want newly active namenode to make replication decisions without understanding the complete state of blocks from all the DNs.
          Hide
          Todd Lipcon added a comment -

          STONITH is one possible fencing mechanism, but requires special hardware support (eg a remotely controllable PDU or ILOM-like capability on the machine). This addresses the namenode side of fencing: how do we make sure that a previously active NN can no longer write to the shared edits storage (ie ensure exclusive access to the new active).

          With many storage types there are less drastic fencing methods available - eg filers often support an operation to fence off a particular IP from a given volume. Software systems like bookkeeper might support a "lease revoke" operation of sorts (just a guess). So we shouldn't design STONITH as the only option if we can use other options with less custom hardware necessary.

          However, the above NN fencing methods don't deal with the races described here – the issue is that the standby necessarily has a stale view of pending deletions in the cluster. We need to essentially "flush" all deletions from the cluster before the new NN can make appropriate deletion decisions. This is because block replication decisions are not persisted to the shared storage. The issues mentioned here are important even in the case of manual transition from one NN to another.

          Show
          Todd Lipcon added a comment - STONITH is one possible fencing mechanism, but requires special hardware support (eg a remotely controllable PDU or ILOM-like capability on the machine). This addresses the namenode side of fencing: how do we make sure that a previously active NN can no longer write to the shared edits storage (ie ensure exclusive access to the new active). With many storage types there are less drastic fencing methods available - eg filers often support an operation to fence off a particular IP from a given volume. Software systems like bookkeeper might support a "lease revoke" operation of sorts (just a guess). So we shouldn't design STONITH as the only option if we can use other options with less custom hardware necessary. However, the above NN fencing methods don't deal with the races described here – the issue is that the standby necessarily has a stale view of pending deletions in the cluster. We need to essentially "flush" all deletions from the cluster before the new NN can make appropriate deletion decisions. This is because block replication decisions are not persisted to the shared storage. The issues mentioned here are important even in the case of manual transition from one NN to another.
          Hide
          Hari Mankude added a comment -

          I had a question. Why are we not depending on stonith to kill one of the namenodes? Assuming stonith (along with zk) is used to failfast the active namenode before standby takes over, can the races that are mentioned above happen?

          Show
          Hari Mankude added a comment - I had a question. Why are we not depending on stonith to kill one of the namenodes? Assuming stonith (along with zk) is used to failfast the active namenode before standby takes over, can the races that are mentioned above happen?
          Hide
          Suresh Srinivas added a comment -

          Oops sorry, wrong jira.

          Show
          Suresh Srinivas added a comment - Oops sorry, wrong jira.
          Hide
          Suresh Srinivas added a comment -

          BTW again we have many discussions related to the same topic happening on diff jiras. It would be good we stick to HDFS-2664.

          Show
          Suresh Srinivas added a comment - BTW again we have many discussions related to the same topic happening on diff jiras. It would be good we stick to HDFS-2664 .
          Hide
          dhruba borthakur added a comment -

          findOverReplicatedReplicas runs with the readLock. chooseExcessReplicates runs without any lock but returns it results rather than modifying any global data structures. Then we acquire the fsnamesystem lock, quickly validate that the state of the block has not changed since we did all the computation (and most of the time the state does not change), and then proceed to do the action.

          Show
          dhruba borthakur added a comment - findOverReplicatedReplicas runs with the readLock. chooseExcessReplicates runs without any lock but returns it results rather than modifying any global data structures. Then we acquire the fsnamesystem lock, quickly validate that the state of the block has not changed since we did all the computation (and most of the time the state does not change), and then proceed to do the action.
          Hide
          Todd Lipcon added a comment -

          Dhruba, that's an interesting idea. I hadn't considered whether processMisReplicatedBlocks could be run in a background thread.

          Do you do this by chunking up the work so you only have to hold the lock for short periods of time? Or you think it can run entirely lockless?

          I'll take a look at your github link today.

          Show
          Todd Lipcon added a comment - Dhruba, that's an interesting idea. I hadn't considered whether processMisReplicatedBlocks could be run in a background thread. Do you do this by chunking up the work so you only have to hold the lock for short periods of time? Or you think it can run entirely lockless? I'll take a look at your github link today.
          Hide
          dhruba borthakur added a comment -

          > If an invalidateBlock call has to be postponed, add that block to a blocksToRecheck list of some kind.
          When all of the DNs have become trusted, scan all of the blocks in blocksToRecheck checking for misreplication.

          Suppose we do not keep a blocksToRecheck list. When any DN flips its trusted flag to true, check if all of the DNs are now trusted. If this is the case, go through the processMisReplicatedBlocks code path to scan all blocks in the system for potential over-replication, adding them to the invalidate queue. The execution of the processMisReplicatedBlocks() can occur in a background thread... better still, chooseExcessReplicates() does not really need the FSNamesystem lock. We have done this just to increase scalability: http://bit.ly/rUDVui

          Just a thought to not increase yet another data structure 'blocksToRecheck', but if you feel that this is absolutely necessary, I am fine with it.

          Show
          dhruba borthakur added a comment - > If an invalidateBlock call has to be postponed, add that block to a blocksToRecheck list of some kind. When all of the DNs have become trusted, scan all of the blocks in blocksToRecheck checking for misreplication. Suppose we do not keep a blocksToRecheck list. When any DN flips its trusted flag to true, check if all of the DNs are now trusted. If this is the case, go through the processMisReplicatedBlocks code path to scan all blocks in the system for potential over-replication, adding them to the invalidate queue. The execution of the processMisReplicatedBlocks() can occur in a background thread... better still, chooseExcessReplicates() does not really need the FSNamesystem lock. We have done this just to increase scalability: http://bit.ly/rUDVui Just a thought to not increase yet another data structure 'blocksToRecheck', but if you feel that this is absolutely necessary, I am fine with it.
          Todd Lipcon made changes -
          Component/s ha [ 12316609 ]
          Todd Lipcon made changes -
          Attachment hdfs-1972.txt [ 12507114 ]
          Hide
          Todd Lipcon added a comment -

          OK. I have an implementation of "Solution 2" from above which I believe works correctly in all the cases. A few minor differences from what's described above:

          • It turns out that the DN side of the block report didn't need to be modified. The DNA_INVALIDATE command already calls FSDataset.invalidate as soon as it enqueues it for asynchronous deletion. This will remove it from the DN side volumeMap structure, which is where we generate the block report from.

          One possible race here which is worth investigating: the DirectoryScanner could find the block file right before it's deleted, and re-add it to the volume map. I'd like to address this as a follow-up since it's a rare race and will take a while to write a decent test case for it.

          • Rather than modifying the BR call to include a flag to acknowledge active state, I changed the NN side to carry an additional flag which is set on the first heartbeat after failover. This has the same effect and I think was a little simpler than making a protocol change.

          I've also added a metric and metasave output for the list of postponed block deletions.

          This patch applies on top of HDFS-2602 from https://issues.apache.org/jira/secure/attachment/12506819/HDFS-2602.patch but will probably need to be re-generated after that's committed.

          Show
          Todd Lipcon added a comment - OK. I have an implementation of "Solution 2" from above which I believe works correctly in all the cases. A few minor differences from what's described above: It turns out that the DN side of the block report didn't need to be modified. The DNA_INVALIDATE command already calls FSDataset.invalidate as soon as it enqueues it for asynchronous deletion. This will remove it from the DN side volumeMap structure, which is where we generate the block report from. One possible race here which is worth investigating: the DirectoryScanner could find the block file right before it's deleted, and re-add it to the volume map. I'd like to address this as a follow-up since it's a rare race and will take a while to write a decent test case for it. Rather than modifying the BR call to include a flag to acknowledge active state, I changed the NN side to carry an additional flag which is set on the first heartbeat after failover. This has the same effect and I think was a little simpler than making a protocol change. I've also added a metric and metasave output for the list of postponed block deletions. This patch applies on top of HDFS-2602 from https://issues.apache.org/jira/secure/attachment/12506819/HDFS-2602.patch but will probably need to be re-generated after that's committed.
          Hide
          Todd Lipcon added a comment -

          Rough sketch of the design here (names of variables subject to change of course):

          • In each DataNodeDescriptor object in the NN, we add a flag: blockInfoTrusted. When the NN transitions to active mode, it sets this flag to false on every DN.
            This flag indicates in the NN that the DN may have some pending deletions issued by a prior NN that we're unaware of. That is to say, we can't necessarily trust that we have 100% up-to-date block information.
          • Make two modifications to DN's block report behavior:
            • Add a flag indicating whether the DN considered that NN active at the time when it began generating the block report.
              This flag acts as the "promise" described above. The DN is promising that, since it has acknowledged this NN as active, it won't accept commands from an NN with an earlier txid.

          Include any blocks that are pending deletion as if they were deleted.
          This ensures that, even if block deletions are slow to be enacted due to slow local disks, the new NN will know that these blocks are on their way to deletion.

          • When the NN receives a block report with the new flag set, then flip its trusted flag. The block report will include any deletions or pending deletions, plus carries the promise above, so from this point forward, the NN has full "control" over the DN. That is to say, fencing is complete for this DN.
          • The NN postpones invalidation of any blocks with replicas on "untrusted" DNs. In invalidateBlocks, we already call countNodes to iterate over the other locations of the block. If any of those nodes is still considered untrusted, then we cannot make a safe invalidation decision about this block.
            • Optimization: if the invalidation is because the file itself has been deleted, it's safe to invalidate all the replicas regardless of trusted status.
            • Potential Optimization: if the location to be invalidated is untrusted, we can invalidate it - worst case we just issue a duplicate invalidation.

          How to postpone invalidation:

          Solution 1:
          • Ignore the invalidateBlock call when it has to be postponed.
          • When any DN flips its trusted flag to true, check if all of the DNs are now trusted. If this is the case, go through the processMisReplicatedBlocks code path to scan all blocks in the system for potential over-replication, adding them to the invalidate queue.

          Pro: fairly simple to implement
          Con: have to scan the entire blockmap once the last DN has become trusted - this can take 10+ seconds while holding the FSN lock, which is a little bit nasty.

          Solution 2 (optimization of Solution 1):
          • If an invalidateBlock call has to be postponed, add that block to a blocksToRecheck list of some kind.
          • When all of the DNs have become trusted, scan all of the blocks in blocksToRecheck checking for misreplication. The number of blocks in this list is expected to be very short - it only contains those blocks which were over-replicated right around the time of failover. So this should be a trivial amount of time to hold the lock.

          Pro: no long period holding FSN lock
          Con: a little more complicated


          I'm still working out some of the kinks above, but plan to start implementing and see where it takes me. Feedback appreciated if there is a simpler option.

          One interesting thing: I believe this issue can actually present itself even in a non-HA setup.

          • Filesystem running with many files with high replication count
          • User issues fs -setrep 1 to reduce replication of most of the files in the system
          • NN crashes and restarts on same machine
          • DNs re-register with the same NN while they still have items in their deletion queues. The new NN chooses different replicas to remove. Data-loss ensues

          The DN-side fix above which includes pending-deletion blocks as deleted in the BR should fix this non-HA issue as well.

          Show
          Todd Lipcon added a comment - Rough sketch of the design here (names of variables subject to change of course): In each DataNodeDescriptor object in the NN, we add a flag: blockInfoTrusted . When the NN transitions to active mode, it sets this flag to false on every DN. This flag indicates in the NN that the DN may have some pending deletions issued by a prior NN that we're unaware of. That is to say, we can't necessarily trust that we have 100% up-to-date block information. Make two modifications to DN's block report behavior: Add a flag indicating whether the DN considered that NN active at the time when it began generating the block report. This flag acts as the "promise" described above. The DN is promising that, since it has acknowledged this NN as active, it won't accept commands from an NN with an earlier txid. – Include any blocks that are pending deletion as if they were deleted. This ensures that, even if block deletions are slow to be enacted due to slow local disks, the new NN will know that these blocks are on their way to deletion. When the NN receives a block report with the new flag set, then flip its trusted flag . The block report will include any deletions or pending deletions, plus carries the promise above, so from this point forward, the NN has full "control" over the DN. That is to say, fencing is complete for this DN. The NN postpones invalidation of any blocks with replicas on "untrusted" DNs . In invalidateBlocks , we already call countNodes to iterate over the other locations of the block. If any of those nodes is still considered untrusted, then we cannot make a safe invalidation decision about this block. Optimization: if the invalidation is because the file itself has been deleted, it's safe to invalidate all the replicas regardless of trusted status. Potential Optimization: if the location to be invalidated is untrusted, we can invalidate it - worst case we just issue a duplicate invalidation. How to postpone invalidation: Solution 1: Ignore the invalidateBlock call when it has to be postponed. When any DN flips its trusted flag to true, check if all of the DNs are now trusted. If this is the case, go through the processMisReplicatedBlocks code path to scan all blocks in the system for potential over-replication, adding them to the invalidate queue. Pro: fairly simple to implement Con: have to scan the entire blockmap once the last DN has become trusted - this can take 10+ seconds while holding the FSN lock, which is a little bit nasty. Solution 2 (optimization of Solution 1): If an invalidateBlock call has to be postponed, add that block to a blocksToRecheck list of some kind. When all of the DNs have become trusted, scan all of the blocks in blocksToRecheck checking for misreplication. The number of blocks in this list is expected to be very short - it only contains those blocks which were over-replicated right around the time of failover. So this should be a trivial amount of time to hold the lock. Pro: no long period holding FSN lock Con: a little more complicated I'm still working out some of the kinks above, but plan to start implementing and see where it takes me. Feedback appreciated if there is a simpler option. One interesting thing: I believe this issue can actually present itself even in a non-HA setup. Filesystem running with many files with high replication count User issues fs -setrep 1 to reduce replication of most of the files in the system NN crashes and restarts on same machine DNs re-register with the same NN while they still have items in their deletion queues. The new NN chooses different replicas to remove. Data-loss ensues The DN-side fix above which includes pending-deletion blocks as deleted in the BR should fix this non-HA issue as well.
          Hide
          Eli Collins added a comment -

          The proposed solution looks solid to me. Note that technically the promise isn't that the DN won't accept any further commands from a previous NN, but that it won't accept any commands from a NN with a lower sequence number (the same promise that an acceptor in Paxos makes btw). If the previous NN showed up with a higher seq no due to fail back the DN should accept commands from it.

          Show
          Eli Collins added a comment - The proposed solution looks solid to me. Note that technically the promise isn't that the DN won't accept any further commands from a previous NN, but that it won't accept any commands from a NN with a lower sequence number (the same promise that an acceptor in Paxos makes btw). If the previous NN showed up with a higher seq no due to fail back the DN should accept commands from it.
          Hide
          Todd Lipcon added a comment -

          I broke out the basic implementation of determining who is active based on heartbeat responses to HDFS-2627.

          I've thought a bit about the proposed solution above and haven't come up with any holes yet... my next steps for early next week will be these (hopefully separate patches for each for easy review):

          • finish HDFS-2627 to a commitable state (first draft patch is up but needs a functional test and some code/style cleanup)
          • implement the DN side "promise" functionality - when it detects a new active, it needs to ACK the active transition, and the NN needs to keep track of whether each DN has ACKed.
          • implement code in the NN which prevents issuance of block invalidation until DNs have acked and sent block reports
          • implement a stress test: create many blocks, and toggle them back and forth between replication level 1 and replication level 2. fail back and forth between two NNs continuously. Ensure that after several hours of runtime we haven't lost any blocks.
          Show
          Todd Lipcon added a comment - I broke out the basic implementation of determining who is active based on heartbeat responses to HDFS-2627 . I've thought a bit about the proposed solution above and haven't come up with any holes yet... my next steps for early next week will be these (hopefully separate patches for each for easy review): finish HDFS-2627 to a commitable state (first draft patch is up but needs a functional test and some code/style cleanup) implement the DN side "promise" functionality - when it detects a new active, it needs to ACK the active transition, and the NN needs to keep track of whether each DN has ACKed. implement code in the NN which prevents issuance of block invalidation until DNs have acked and sent block reports implement a stress test: create many blocks, and toggle them back and forth between replication level 1 and replication level 2. fail back and forth between two NNs continuously. Ensure that after several hours of runtime we haven't lost any blocks.
          Todd Lipcon made changes -
          Link This issue incorporates HDFS-2627 [ HDFS-2627 ]
          Todd Lipcon made changes -
          Attachment hdfs-1972-v1.txt [ 12505840 ]
          Hide
          Todd Lipcon added a comment -

          Here's some initial work on this JIRA - the NNs return their HA state in the heartbeat, and the DNs use that to determine which is active. I may break this out to a separate JIRA for commit so that the patches are manageable to review.

          Still need to implement the rest of the above proposal.

          Show
          Todd Lipcon added a comment - Here's some initial work on this JIRA - the NNs return their HA state in the heartbeat, and the DNs use that to determine which is active. I may break this out to a separate JIRA for commit so that the patches are manageable to review. Still need to implement the rest of the above proposal.
          Hide
          dhruba borthakur added a comment -

          I recollect that we prevent primary namenode from issuing BLOCK_INVALIDATE commands for a particular datanode until the time it has received a full block report from that datanode (since the last failover event). But I will read proposal 3 in greater detail,

          Show
          dhruba borthakur added a comment - I recollect that we prevent primary namenode from issuing BLOCK_INVALIDATE commands for a particular datanode until the time it has received a full block report from that datanode (since the last failover event). But I will read proposal 3 in greater detail,
          Hide
          Andrew Purtell added a comment -

          I will go back to lurking on this issue right away but kindly allow me to +1 this notion:

          Persisting the txid in the DN disks actually has another nice property for non-HA clusters – if you accidentally restart the NN from an old snapshot of the filesystem state, the DNs can refuse to connect, or refuse to process deletions. Currently, in this situation, the DNs would connect and then delete all of the newer blocks.

          Encountering this scenario through a series of accidents has been a concern. Disallowing block deletion as proposed would be enough to give the operators a chance to recover from their mistake before permanent damage.

          Show
          Andrew Purtell added a comment - I will go back to lurking on this issue right away but kindly allow me to +1 this notion: Persisting the txid in the DN disks actually has another nice property for non-HA clusters – if you accidentally restart the NN from an old snapshot of the filesystem state, the DNs can refuse to connect, or refuse to process deletions. Currently, in this situation, the DNs would connect and then delete all of the newer blocks. Encountering this scenario through a series of accidents has been a concern. Disallowing block deletion as proposed would be enough to give the operators a chance to recover from their mistake before permanent damage.
          Hide
          Todd Lipcon added a comment -

          Here are some thoughts on this JIRA, hopefully organized in a way that's conducive to discussion:

          Point 0:

          We should always err on the side of conservativism – would rather introduce inefficiency rather than incorrectness.

          Point 1:

          Of the commands that come from NN to DN, the only one that's really dangerous is DNA_INVALIDATE:

          • DNA_REGISTER is scoped to a single NN
          • DNA_TRANSFER, if done mistakenly, can only result in an extra replica, which would be cleaned up later
          • DNA_RECOVERBLOCK - should be idempotent (though we should investigate this separately)
          • DNA_ACCESSKEYUPDATE - this is logged to the edit log, so the new NN should also have the same keys (also need to check on this)

          DNA_INVALIDATE however could cause data loss if the NNs are both trying to deal with an over-replicated block, but make different decisions on which replicas to invalidate.

          Point 2:

          Because block invalidation is reported asynchronously, the new active during a failover will not be aware of pending deletions. For example:

          • block has 2 replicas, user asks to reduce to 1
          • NN1 asks DN1 to delete it
          • DN1 receives the command, deletes the replica, but does not do a deletion report yet
          • Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge the new active.
          • NN2 doesn't know that DN1 has deleted the block, since it hasn't received a deletion report yet
          • NN2 asks DN2 to delete it. DN2 complies because NN2 is active.
          • Now we have no replicas

          So, even with a completely consistent view of which is active, we have an issue. So, the solution to this issue isn't just to make a consistent switchover on each DN - we need some stronger guarantee that prevents deletions during the period when the new NN's cluster view is out of date.

          Proposed solution to above:

          After a failover, the active NN should not issue any deletions until all pending deletions from prior NNs have been processed. So the fencing happens at two points:

          In DN: Upon deciding that some NN is newly active, it issues a promise to that NN that it will not accept any further commands from a previous NN. [1]
          In NN: Upon becoming active, it must not issue any deletions until:
          a) Every DN has issued the promise described above
          b) Every DN has also issued a block deletion report[2] which includes all pending or processed deletions.

          [1] this promise may have to be recorded in its on-disk state - see Scenario 3 below.
          [2] If we want to be really paranoid, we could wait for a full block report.

          A few example scenarios:

          Scenario 1: standard failover

          • block has 2 replicas, user asks to reduce to 1
          • NN1 asks DN1 to delete it
          • DN1 receives the command, deletes the replica, but does not do a deletion report yet
          • Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge NN2 is active and promise not to accept commands from NN1.
          • NN1 might try to continue to issue commands but they will be ignored.
          • NN2 doesn't know that DN1 has deleted the block, since it hasn't sent a deletion report yet.
          • NN2 considers the block over-replicated, but does not send an INVALIDATE command, because it hasn't gotten a deletion report from all DNs.
          • DN1 sends its deletion report. DN2 sends a deletion report which is empty.
          • Since NN2 has now received a deletion report from all DNs, it knows there's only one replica, and doesn't ask for the deletion.

          Scenario 2: cluster partition (NN1 and DN1 on one side of the cluster, NN2 and DN2 on the other side of the cluster)

          • block has 2 replicas, user asks to reduce to 1
          • NN1 asks DN1 to delete it
          • DN1 receives the command, deletes the replica, but does not do a deletion report yet
          • Network partition happens. NN1 thinks it's still active for some period of time, but NN2 is the one that actually has access to the edit logs
          • DN1 deletes the block, and reports it to NN1. It tries to report to NN2 but it can't connect anymore due to partition.
          • DN2 promises to NN2 that it won't accept commands from NN1. It doesn't talk to NN1 because it's partitioned.
          • NN2 still considers DN1 and DN2 as alive because they haven't timed out yet. Thus it considers the block to be over-replicated.
          • NN2 doesn't send deletion commands because it hasn't gotten a deletion report from all DNs after the failover event.
          • Eventually NN2 considers DN1 dead. At this point, that replica is removed from the map, so the block is no longer considered over-replicated.
          • Thus, NN2 doesn't delete the good replica on DN2.

          Scenario 3: DN restarts during split brain period

          (this scenario illustrates why I think we need to persistently record the promise about who is active)

          • block has 2 replicas, user asks to reduce to 1
          • NN1 adds the block to DN1's invalidation queue, but it's backed up behind a bunch of other commands, so doesn't get issued yet.
          • Failover occurs, but NN1 still thinks it's active.
          • DN1 promises to NN2 not to accept commands from NN1. It sends an empty deletion report to NN2. Then, it crashes.
          • NN2 has received a deletion report from everyone, and asks DN2 to delete the block. It hasn't realized that DN1 is crashed yet.
          • DN2 deletes the block.
          • DN1 starts back up. When it comes back up, it talks to NN1 first (maybe it takes a while to connect to NN2 for some reason)
            • ** Now, if we had saved the "promise" as part of persistent state, we could ignore NN1 and avoid this issue. Otherwise:
            • NN1 still thinks it's active, and sends a command to DN1 to delete the block. DN1 does so.
            • We lost the block

          Persisting the txid in the DN disks actually has another nice property for non-HA clusters – if you accidentally restart the NN from an old snapshot of the filesystem state, the DNs can refuse to connect, or refuse to process deletions. Currently, in this situation, the DNs would connect and then delete all of the newer blocks.

          Show
          Todd Lipcon added a comment - Here are some thoughts on this JIRA, hopefully organized in a way that's conducive to discussion: Point 0: We should always err on the side of conservativism – would rather introduce inefficiency rather than incorrectness. Point 1: Of the commands that come from NN to DN, the only one that's really dangerous is DNA_INVALIDATE : DNA_REGISTER is scoped to a single NN DNA_TRANSFER, if done mistakenly, can only result in an extra replica, which would be cleaned up later DNA_RECOVERBLOCK - should be idempotent (though we should investigate this separately) DNA_ACCESSKEYUPDATE - this is logged to the edit log, so the new NN should also have the same keys (also need to check on this) DNA_INVALIDATE however could cause data loss if the NNs are both trying to deal with an over-replicated block, but make different decisions on which replicas to invalidate. Point 2: Because block invalidation is reported asynchronously, the new active during a failover will not be aware of pending deletions. For example: block has 2 replicas, user asks to reduce to 1 NN1 asks DN1 to delete it DN1 receives the command, deletes the replica, but does not do a deletion report yet Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge the new active. NN2 doesn't know that DN1 has deleted the block, since it hasn't received a deletion report yet NN2 asks DN2 to delete it. DN2 complies because NN2 is active. Now we have no replicas So, even with a completely consistent view of which is active, we have an issue. So, the solution to this issue isn't just to make a consistent switchover on each DN - we need some stronger guarantee that prevents deletions during the period when the new NN's cluster view is out of date . Proposed solution to above: After a failover, the active NN should not issue any deletions until all pending deletions from prior NNs have been processed. So the fencing happens at two points: In DN : Upon deciding that some NN is newly active, it issues a promise to that NN that it will not accept any further commands from a previous NN. [1] In NN : Upon becoming active, it must not issue any deletions until: a) Every DN has issued the promise described above b) Every DN has also issued a block deletion report [2] which includes all pending or processed deletions. [1] this promise may have to be recorded in its on-disk state - see Scenario 3 below. [2] If we want to be really paranoid, we could wait for a full block report. A few example scenarios: Scenario 1: standard failover block has 2 replicas, user asks to reduce to 1 NN1 asks DN1 to delete it DN1 receives the command, deletes the replica, but does not do a deletion report yet Failover NN1->NN2 occurs. Both DN1 and DN2 properly acknowledge NN2 is active and promise not to accept commands from NN1 . NN1 might try to continue to issue commands but they will be ignored . NN2 doesn't know that DN1 has deleted the block, since it hasn't sent a deletion report yet. NN2 considers the block over-replicated, but does not send an INVALIDATE command, because it hasn't gotten a deletion report from all DNs . DN1 sends its deletion report. DN2 sends a deletion report which is empty. Since NN2 has now received a deletion report from all DNs, it knows there's only one replica, and doesn't ask for the deletion . Scenario 2: cluster partition (NN1 and DN1 on one side of the cluster, NN2 and DN2 on the other side of the cluster) block has 2 replicas, user asks to reduce to 1 NN1 asks DN1 to delete it DN1 receives the command, deletes the replica, but does not do a deletion report yet Network partition happens. NN1 thinks it's still active for some period of time, but NN2 is the one that actually has access to the edit logs DN1 deletes the block, and reports it to NN1. It tries to report to NN2 but it can't connect anymore due to partition. DN2 promises to NN2 that it won't accept commands from NN1. It doesn't talk to NN1 because it's partitioned . NN2 still considers DN1 and DN2 as alive because they haven't timed out yet. Thus it considers the block to be over-replicated. NN2 doesn't send deletion commands because it hasn't gotten a deletion report from all DNs after the failover event . Eventually NN2 considers DN1 dead. At this point, that replica is removed from the map, so the block is no longer considered over-replicated. Thus, NN2 doesn't delete the good replica on DN2. Scenario 3: DN restarts during split brain period (this scenario illustrates why I think we need to persistently record the promise about who is active) block has 2 replicas, user asks to reduce to 1 NN1 adds the block to DN1's invalidation queue, but it's backed up behind a bunch of other commands, so doesn't get issued yet. Failover occurs, but NN1 still thinks it's active. DN1 promises to NN2 not to accept commands from NN1. It sends an empty deletion report to NN2. Then, it crashes. NN2 has received a deletion report from everyone, and asks DN2 to delete the block. It hasn't realized that DN1 is crashed yet. DN2 deletes the block. DN1 starts back up. When it comes back up, it talks to NN1 first (maybe it takes a while to connect to NN2 for some reason) ** Now, if we had saved the "promise" as part of persistent state, we could ignore NN1 and avoid this issue. Otherwise: NN1 still thinks it's active, and sends a command to DN1 to delete the block. DN1 does so. We lost the block Persisting the txid in the DN disks actually has another nice property for non-HA clusters – if you accidentally restart the NN from an old snapshot of the filesystem state, the DNs can refuse to connect, or refuse to process deletions. Currently, in this situation, the DNs would connect and then delete all of the newer blocks.
          Hide
          Todd Lipcon added a comment -

          Let me think through that case - it's an interesting one I hadn't considered yet.

          Show
          Todd Lipcon added a comment - Let me think through that case - it's an interesting one I hadn't considered yet.
          Hide
          Suresh Srinivas added a comment -

          BTW looking through your proposal, you cannot handle the case where a datanode joins the cluster and only sees during split brain, one of the active nodes (that has remained unexpectedly remained in active state). Hence either we depend on ZK to provide the answer and assume that fencing has taken care of shutting down the errant active.

          Show
          Suresh Srinivas added a comment - BTW looking through your proposal, you cannot handle the case where a datanode joins the cluster and only sees during split brain, one of the active nodes (that has remained unexpectedly remained in active state). Hence either we depend on ZK to provide the answer and assume that fencing has taken care of shutting down the errant active.
          Hide
          Todd Lipcon added a comment -

          Yep, that sounds like a good idea. I was going back and forth between your solution and the one I outlined above when I wrote the above comment. But since you had the same idea too, it's probably better

          Show
          Todd Lipcon added a comment - Yep, that sounds like a good idea. I was going back and forth between your solution and the one I outlined above when I wrote the above comment. But since you had the same idea too, it's probably better
          Hide
          Suresh Srinivas added a comment -

          My thoughts were a bit different from yours:

          1. I wanted to change heart beat response to from an array of commands to HeartbeatResponse with set of commands in it.
          2. HeartbeatResponse would also have other information - to start with it will have the state of the namenode. This will be used by Datanode node to handle active/standby transitions.

          HeartbeatResponse also come in addition, with being able to pass other information in the future (for example, current transactionID of the namenode etc.).

          What do you think?

          Show
          Suresh Srinivas added a comment - My thoughts were a bit different from yours: I wanted to change heart beat response to from an array of commands to HeartbeatResponse with set of commands in it. HeartbeatResponse would also have other information - to start with it will have the state of the namenode. This will be used by Datanode node to handle active/standby transitions. HeartbeatResponse also come in addition, with being able to pass other information in the future (for example, current transactionID of the namenode etc.). What do you think?
          Todd Lipcon made changes -
          Field Original Value New Value
          Assignee Suresh Srinivas [ sureshms ] Todd Lipcon [ tlipcon ]
          Hide
          Todd Lipcon added a comment -

          Going to assign this to myself since I'm actively working on it. Suresh, if you had started with any design or code, feel free to post what you've got and I'll take it from there.

          Show
          Todd Lipcon added a comment - Going to assign this to myself since I'm actively working on it. Suresh, if you had started with any design or code, feel free to post what you've got and I'll take it from there.
          Hide
          Todd Lipcon added a comment -

          I've been thinking about this design a bit... it's still a little fuzzy in my head, but I'm thinking something like this:

          • In the heartbeat response from the NN, we add a new DatanodeCommand type called CLAIM_ACTIVE_STATE or something. This command includes the current transaction ID from that NN's perspective.
          • In BPServiceActor, when this command is received, it passes it to BPOfferService. BPOfferService maintains the txid of the latest ACTIVE transition. If the new ACTIVE claim's txid is higher than the previous ACTIVE claim's txid, then the DN will agree to consider the new NN the active from that point forward.

          Because the txids are guaranteed unique between the active NNs, this should provide proper fencing. If both of the NNs claim the same txid, then that means we had a failure of fencing in the edit logs, and all bets are off.

          I'm also thinking about whether we can add some extra safeguards to detect inconsistent scenarios and fail fast. (eg if two NNs claim to be active for an overlapping range of ids, we should barf quickly rather than continue down a split brain path)

          Show
          Todd Lipcon added a comment - I've been thinking about this design a bit... it's still a little fuzzy in my head, but I'm thinking something like this: In the heartbeat response from the NN, we add a new DatanodeCommand type called CLAIM_ACTIVE_STATE or something. This command includes the current transaction ID from that NN's perspective. In BPServiceActor, when this command is received, it passes it to BPOfferService. BPOfferService maintains the txid of the latest ACTIVE transition. If the new ACTIVE claim's txid is higher than the previous ACTIVE claim's txid, then the DN will agree to consider the new NN the active from that point forward. Because the txids are guaranteed unique between the active NNs, this should provide proper fencing. If both of the NNs claim the same txid, then that means we had a failure of fencing in the edit logs, and all bets are off. I'm also thinking about whether we can add some extra safeguards to detect inconsistent scenarios and fail fast. (eg if two NNs claim to be active for an overlapping range of ids, we should barf quickly rather than continue down a split brain path)
          Suresh Srinivas created issue -

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development