Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623)
    • Fix Version/s: None
    • Component/s: datanode, ha, namenode
    • Labels:
      None

      Description

      The replication stress test case failed over the weekend since one of the replicas went missing. Still diagnosing the issue, but it seems like the chain of events was something like:

      • a block report was generated on one of the nodes while the block was being written - thus the block report listed the block as RBW
      • when the standby replayed this queued message, it was replayed after the file was marked complete. Thus it marked this replica as corrupt
      • it asked the DN holding the corrupt replica to delete it. And, I think, removed it from the block map at this time.
      • That DN then did another block report before receiving the deletion. This caused it to be re-added to the block map, since it was "FINALIZED" now.
      • Replication was lowered on the file, and it counted the above replica as non-corrupt, and asked for the other replicas to be deleted.
      • All replicas were lost.
      1. hdfs-2742.txt
        78 kB
        Todd Lipcon
      2. hdfs-2742.txt
        79 kB
        Todd Lipcon
      3. hdfs-2742.txt
        75 kB
        Todd Lipcon
      4. hdfs-2742.txt
        83 kB
        Todd Lipcon
      5. hdfs-2742.txt
        79 kB
        Todd Lipcon
      6. hdfs-2742.txt
        74 kB
        Todd Lipcon
      7. hdfs-2742.txt
        56 kB
        Todd Lipcon
      8. hdfs-2742.txt
        5 kB
        Todd Lipcon
      9. log-colorized.txt
        7.47 MB
        Todd Lipcon

        Activity

        Hide
        Todd Lipcon added a comment -

        Here's the log with some color added to make it easier to read. View with "less -R". The block in question is blk_6553001497155157032_1081

        Show
        Todd Lipcon added a comment - Here's the log with some color added to make it easier to read. View with "less -R". The block in question is blk_6553001497155157032_1081
        Hide
        Todd Lipcon added a comment -

        This is really two separate bugs.
        Bug 1 (HA specific): if a block report reports a RBW replica, and it's delayed on the SBN, then the file is closed before the SBN processes the delayed block report, it will mark the block as corrupt incorrectly. I'll post a patch to fix this bug - just running some more tests on it.

        Bug 2 (non-HA): If a block is marked corrupt, and the DN does a block report, it will unmark the corrupt state of that block. I believe this is a regression of HDFS-900, but there wasn't any unit test with that bug fix so it's hard to know if it's the same or subtly different. I believe this affects the non-HA case as well, so I'll file a JIRA for it against trunk.

        Show
        Todd Lipcon added a comment - This is really two separate bugs. Bug 1 (HA specific): if a block report reports a RBW replica, and it's delayed on the SBN, then the file is closed before the SBN processes the delayed block report, it will mark the block as corrupt incorrectly. I'll post a patch to fix this bug - just running some more tests on it. Bug 2 (non-HA): If a block is marked corrupt, and the DN does a block report, it will unmark the corrupt state of that block. I believe this is a regression of HDFS-900 , but there wasn't any unit test with that bug fix so it's hard to know if it's the same or subtly different. I believe this affects the non-HA case as well, so I'll file a JIRA for it against trunk.
        Hide
        Todd Lipcon added a comment -

        Attached patch adds a test to trigger the issue, and a fix.

        Show
        Todd Lipcon added a comment - Attached patch adds a test to trigger the issue, and a fix.
        Hide
        Todd Lipcon added a comment -

        This seems to have caused some issues in TestHASafeMode. I'll upload a new rev later today.

        Show
        Todd Lipcon added a comment - This seems to have caused some issues in TestHASafeMode. I'll upload a new rev later today.
        Hide
        Todd Lipcon added a comment -

        Looking into this has made me aware of a lurking can of worms... the summary of the issue is that we have to make sure the interleaving of block state transitions and messages from the datanode is maintained. The sequence needs to look like:

        1. Block is allocated
        2. Block reports may arrive with the block in RBW state
        3. Block is "completed"
        4. Block reports and blockReceived messages may arrive with the block in FINALIZED state.

        On the active node, this sequence is guaranteed since we only mark a block as "complete" once the minimum number of replicas has reported FINALIZED. And once any replica reports FINALIZED, we shouldn't see anymore RBW replicas, unless they're truly corrupt.

        On the standby node, though, the application of the edits are delayed until it reads the shared storage log. So it may receive step 2 and step 4 long before it even knows about the block. The trick is that we need to interleave them into the correct position in the edits stream.

        The issue in this JIRA is that, in the tip of the branch today, we are processing all queued messages after applying all the edits. So, if we received a block report with an RBW replica, it will be processed after the replica is already completed, thus swapping step 2 and 3 in the above sequence. This results in the block being marked as corrupt.
        If instead we try to process the queued messages as soon as we first hear about the block, we have the opposite problem – step 3 and step 4 are switched. This causes problems for Safe Mode since it isn't properly accounting the number of complete blocks in that case. Hence the patch currently attached to this JIRA breaks TestHASafeMode.

        I spent the afternoon thinking about it, and the best solution I can come up with is the following:

        • rather than a single PendingDatanodeMessages queue, where we queue up the entire block report or blockReceived message, we should make the queueing more fine-grained. So, if we receive a block report, we can "open it up" and handle each block separately. For each block, we have a few cases:
          • correct state: the replica has the right genstamp and a consistent state - eg an RBW replica for an in-progress block or a FINALIZED replica for a completed block. We can handle these immediately.
          • too-high genstamp: the replica being reported has a higher generation stamp than what we think is current for the block. Queue it.
          • correct genstamp, wrong state: eg a FINALIZED replica for an incomplete block. Queue it.
        • When replaying edits, check the queue whenever (a) a new block is created, (b) a block's genstamp is updated, (c) a block's completion state is changed
          • if the block has just become complete, process any FINALIZED reports
          • if the block has just been allocated or gen-stamp-bumped, process any RBW reports
        • During a failover, after we have completely caught up our namespace state, process all pending messages regardless of whether they are "consistent".

        This is kind of complicated, but I can't think of much better. The one nice advantage it brings is that we don't have to delay a large BR full of old blocks just because it happens to include just one new block. This should keep the standby "hotter" and avoid using a bunch of memory for queued messages.

        Show
        Todd Lipcon added a comment - Looking into this has made me aware of a lurking can of worms... the summary of the issue is that we have to make sure the interleaving of block state transitions and messages from the datanode is maintained. The sequence needs to look like: 1. Block is allocated 2. Block reports may arrive with the block in RBW state 3. Block is "completed" 4. Block reports and blockReceived messages may arrive with the block in FINALIZED state. On the active node, this sequence is guaranteed since we only mark a block as "complete" once the minimum number of replicas has reported FINALIZED. And once any replica reports FINALIZED, we shouldn't see anymore RBW replicas, unless they're truly corrupt. On the standby node, though, the application of the edits are delayed until it reads the shared storage log. So it may receive step 2 and step 4 long before it even knows about the block. The trick is that we need to interleave them into the correct position in the edits stream. The issue in this JIRA is that, in the tip of the branch today, we are processing all queued messages after applying all the edits. So, if we received a block report with an RBW replica, it will be processed after the replica is already completed, thus swapping step 2 and 3 in the above sequence. This results in the block being marked as corrupt. If instead we try to process the queued messages as soon as we first hear about the block, we have the opposite problem – step 3 and step 4 are switched. This causes problems for Safe Mode since it isn't properly accounting the number of complete blocks in that case. Hence the patch currently attached to this JIRA breaks TestHASafeMode. I spent the afternoon thinking about it, and the best solution I can come up with is the following: rather than a single PendingDatanodeMessages queue, where we queue up the entire block report or blockReceived message, we should make the queueing more fine-grained. So, if we receive a block report, we can "open it up" and handle each block separately. For each block, we have a few cases: correct state : the replica has the right genstamp and a consistent state - eg an RBW replica for an in-progress block or a FINALIZED replica for a completed block. We can handle these immediately. too-high genstamp : the replica being reported has a higher generation stamp than what we think is current for the block. Queue it. correct genstamp, wrong state : eg a FINALIZED replica for an incomplete block. Queue it. When replaying edits, check the queue whenever (a) a new block is created, (b) a block's genstamp is updated, (c) a block's completion state is changed if the block has just become complete, process any FINALIZED reports if the block has just been allocated or gen-stamp-bumped, process any RBW reports During a failover, after we have completely caught up our namespace state, process all pending messages regardless of whether they are "consistent". This is kind of complicated, but I can't think of much better. The one nice advantage it brings is that we don't have to delay a large BR full of old blocks just because it happens to include just one new block. This should keep the standby "hotter" and avoid using a bunch of memory for queued messages.
        Hide
        Todd Lipcon added a comment -

        I'm looking into a simpler solution where we redo a bit of how the "safe block count" is tracked during safemode. This would allow "3" and "4" to be safely inverted above without breaking safemode, if I can get it to work right.

        Show
        Todd Lipcon added a comment - I'm looking into a simpler solution where we redo a bit of how the "safe block count" is tracked during safemode. This would allow "3" and "4" to be safely inverted above without breaking safemode, if I can get it to work right.
        Hide
        Suresh Srinivas added a comment -

        Todd, Couple of questions:

        1. What is the implication of ignoring RBW altogether at the standby?
        2. If editlog has a finalized record, can we just ignore the RBW from the block report?
        Show
        Suresh Srinivas added a comment - Todd, Couple of questions: What is the implication of ignoring RBW altogether at the standby? If editlog has a finalized record, can we just ignore the RBW from the block report?
        Hide
        Todd Lipcon added a comment -

        What is the implication of ignoring RBW altogether at the standby?

        That's an idea I've thought a little about, but I think it has some implications for lease recovery. In actuality, in order to fix the cases in HDFS-2691, I think we need to send RBW blockReceived messages to the SBN as soon as a pipeline is constructed.

        I do like it, though, as at least a stop-gap for now while we work on a more thorough solution.

        If editlog has a finalized record, can we just ignore the RBW from the block report?

        Possibly - I haven't thought through the whole Append state machine. I assumed that the code that marks a RBW replica as corrupt when received for a COMPLETED block is probably there for a good reason... so changing the behavior there might introduce some other bugs that could even hurt the non-HA case.

        I'm going to keep working on this and see if I can come up with a simpler solution based on some of Suresh's ideas above.

        Show
        Todd Lipcon added a comment - What is the implication of ignoring RBW altogether at the standby? That's an idea I've thought a little about, but I think it has some implications for lease recovery. In actuality, in order to fix the cases in HDFS-2691 , I think we need to send RBW blockReceived messages to the SBN as soon as a pipeline is constructed. I do like it, though, as at least a stop-gap for now while we work on a more thorough solution. If editlog has a finalized record, can we just ignore the RBW from the block report? Possibly - I haven't thought through the whole Append state machine. I assumed that the code that marks a RBW replica as corrupt when received for a COMPLETED block is probably there for a good reason... so changing the behavior there might introduce some other bugs that could even hurt the non-HA case. I'm going to keep working on this and see if I can come up with a simpler solution based on some of Suresh's ideas above.
        Hide
        Suresh Srinivas added a comment -

        Here is an idea to handle this problem:
        On standby for rbw replicas, we do nothing and just put then in a queue when received in block report. At this point two things can happen:

        1. Block received comes from datanode indicating finalization of block. We use this to just remove entry from rbw queue.
        2. Block received never comes, then lease recovery has to handle it. We can use rbw to populate the block if it does not exist, before lease recovery.
        Show
        Suresh Srinivas added a comment - Here is an idea to handle this problem: On standby for rbw replicas, we do nothing and just put then in a queue when received in block report. At this point two things can happen: Block received comes from datanode indicating finalization of block. We use this to just remove entry from rbw queue. Block received never comes, then lease recovery has to handle it. We can use rbw to populate the block if it does not exist, before lease recovery.
        Hide
        Hari Mankude added a comment -

        This is a good idea. It keeps the log stream separate from the datanode based block msgs. We would need to handle "lost" msgs. For example, primary might have received a block finalized msg and the same msg was not delivered to standby. So, the block rbw msg on the standby queue will have to be purged after a timeout.

        Show
        Hari Mankude added a comment - This is a good idea. It keeps the log stream separate from the datanode based block msgs. We would need to handle "lost" msgs. For example, primary might have received a block finalized msg and the same msg was not delivered to standby. So, the block rbw msg on the standby queue will have to be purged after a timeout.
        Hide
        Todd Lipcon added a comment -

        Attached patch fixes the issue described here, though I hope to write a few more test cases around the new code, and also add a little code when the NN transitions to Active that flushes the new queue.

        The approach is a combination of my original proposal with some of the ideas from Suresh above:

        • Change the PendingDataNodeMessages queue to be a block-specific queue rather than queueing entire messages. This actually helps the SBN stay up to date better and reduces memory usage, so it's good for other reasons as well. I moved it into BlockManager and added some simple true unit tests for it.
        • When we get a block which is "bad" due to state reasons (eg RBW on a COMPLETE block, or wrong gen stamp, or future gen stamp), we queue it.
        • When the edit log loader hits an OP_ADD or OP_CLOSE which modifies a block, we flush the queue for that block. Any messages that are now "correct" are applied, while "incorrect" messages end up getting re-queued by the same logic as above.
        • I had to modify a couple of the TestHASafeMode tests since the expected behavior is slightly different.

        There are still some edge cases around what happens if a RBW block report is delayed so long that it arrives after the OP_ADD and OP_CLOSE have been replicated. This is true in trunk, as well, though - I opened HDFS-2791 which describes the issue. I think this will be very rare in practice, so would like to address it separately from this JIRA here.

        Show
        Todd Lipcon added a comment - Attached patch fixes the issue described here, though I hope to write a few more test cases around the new code, and also add a little code when the NN transitions to Active that flushes the new queue. The approach is a combination of my original proposal with some of the ideas from Suresh above: Change the PendingDataNodeMessages queue to be a block-specific queue rather than queueing entire messages. This actually helps the SBN stay up to date better and reduces memory usage, so it's good for other reasons as well. I moved it into BlockManager and added some simple true unit tests for it. When we get a block which is "bad" due to state reasons (eg RBW on a COMPLETE block, or wrong gen stamp, or future gen stamp), we queue it. When the edit log loader hits an OP_ADD or OP_CLOSE which modifies a block, we flush the queue for that block. Any messages that are now "correct" are applied, while "incorrect" messages end up getting re-queued by the same logic as above. I had to modify a couple of the TestHASafeMode tests since the expected behavior is slightly different. There are still some edge cases around what happens if a RBW block report is delayed so long that it arrives after the OP_ADD and OP_CLOSE have been replicated. This is true in trunk, as well, though - I opened HDFS-2791 which describes the issue. I think this will be very rare in practice, so would like to address it separately from this JIRA here.
        Hide
        Todd Lipcon added a comment -

        Updated patch addresses issues around append while the SBN is in safe mode, and cleans up some more of the code/logging/etc. I ran the full suite of unit tests last night and they passed.

        Show
        Todd Lipcon added a comment - Updated patch addresses issues around append while the SBN is in safe mode, and cleans up some more of the code/logging/etc. I ran the full suite of unit tests last night and they passed.
        Hide
        Todd Lipcon added a comment -

        I also ran the replication stress test for 10x as long as normal and had no block loss. I'll loop this for a while to make sure it's really stable.

        Show
        Todd Lipcon added a comment - I also ran the replication stress test for 10x as long as normal and had no block loss. I'll loop this for a while to make sure it's really stable.
        Hide
        Todd Lipcon added a comment -

        This may have broken the optimization where we dont initialize repl queues until the end of safemode - we probably need to turn off the incremental tracking of safe block count in the case that it's an empty FS being loaded... Still looking into it.

        Show
        Todd Lipcon added a comment - This may have broken the optimization where we dont initialize repl queues until the end of safemode - we probably need to turn off the incremental tracking of safe block count in the case that it's an empty FS being loaded... Still looking into it.
        Hide
        Todd Lipcon added a comment -

        OK, here's a patch which addresses several issues since the previous revision. I've been testing on a real cluster running HBase by a combination of graceful failovers, full restarts, etc, and think I've ironed out the bugs. I also added a number of new asserts to expose any places where we might have further bugs (and running my cluster with assertions enabled).

        Show
        Todd Lipcon added a comment - OK, here's a patch which addresses several issues since the previous revision. I've been testing on a real cluster running HBase by a combination of graceful failovers, full restarts, etc, and think I've ironed out the bugs. I also added a number of new asserts to expose any places where we might have further bugs (and running my cluster with assertions enabled).
        Hide
        Todd Lipcon added a comment -

        I should also note that this test modifies TestHASafeMode.testEnterSafeModeInANNShouldNotThrowNPE - previously it had an incorrect assertion, that the NN would be in safemode after startup, even though the namesystem was empty. This is not supposed to be the case and was actually a prior bug which got fixed along the way of this patch. I added a new non-HA test case for this behavior in HDFS-2817.

        Show
        Todd Lipcon added a comment - I should also note that this test modifies TestHASafeMode.testEnterSafeModeInANNShouldNotThrowNPE - previously it had an incorrect assertion, that the NN would be in safemode after startup, even though the namesystem was empty. This is not supposed to be the case and was actually a prior bug which got fixed along the way of this patch. I added a new non-HA test case for this behavior in HDFS-2817 .
        Hide
        Todd Lipcon added a comment -

        Updated on tip of branch. This applies on top of HDFS-2791 since one of the test cases fails without it.

        Show
        Todd Lipcon added a comment - Updated on tip of branch. This applies on top of HDFS-2791 since one of the test cases fails without it.
        Hide
        Eli Collins added a comment -

        Todd,

        Approach in the latest patch looks good to me. The tests are great. Mostly minor comments.

        I think BM should distinguish between corrupt and out-of-dates replicas. The new case in processFirstBlockReport in thispatch, and where we mark reported RBW replicas for completed blocks as corrupt are using "corrupt" as a proxy for "please delete". I wasn't able to come up with additional bugs that with a similar cause but it would be easier to reason about if only truly corrupt replicas were marked as such. Can punt to a separate jira, if you agree.

        In FSNamesystem#isSafeModeTrackingBlocks, shouldn't we assert haEnabled is enabled if we're in SM and shouldIncrementallyTrackBlocks is true, instead of short-circuiting? We currently wouldn't know if we violate this condition because we'll return false if haEnabled.

        Nits:

        • s/stam or/stamp or
        • s/tracing/tracking
        • increment|decrementSafeBlockCount need indenting fixes
        • Can remove NameNodeAdapter TestSafeMode diffs
        • Can remove TODO comment in BM#getActiveBlockCount
        • Append "null otherwise" to "if it should be kept" in BM

        Thanks,
        Eli

        Show
        Eli Collins added a comment - Todd, Approach in the latest patch looks good to me. The tests are great. Mostly minor comments. I think BM should distinguish between corrupt and out-of-dates replicas. The new case in processFirstBlockReport in thispatch, and where we mark reported RBW replicas for completed blocks as corrupt are using "corrupt" as a proxy for "please delete". I wasn't able to come up with additional bugs that with a similar cause but it would be easier to reason about if only truly corrupt replicas were marked as such. Can punt to a separate jira, if you agree. In FSNamesystem#isSafeModeTrackingBlocks, shouldn't we assert haEnabled is enabled if we're in SM and shouldIncrementallyTrackBlocks is true, instead of short-circuiting? We currently wouldn't know if we violate this condition because we'll return false if haEnabled. Nits: s/stam or/stamp or s/tracing/tracking increment|decrementSafeBlockCount need indenting fixes Can remove NameNodeAdapter TestSafeMode diffs Can remove TODO comment in BM#getActiveBlockCount Append "null otherwise" to "if it should be kept" in BM Thanks, Eli
        Hide
        Sanjay Radia added a comment -

        Some of the problems get fixed by HDFS-2791 (since RBWs are not aer ignored of the gen stamp matches). The description at the top includes the problem scenarios of HDFS-2791. Todd can you summarize the remaining problem scenarios after HDFS-2791.

        Show
        Sanjay Radia added a comment - Some of the problems get fixed by HDFS-2791 (since RBWs are not aer ignored of the gen stamp matches). The description at the top includes the problem scenarios of HDFS-2791 . Todd can you summarize the remaining problem scenarios after HDFS-2791 .
        Hide
        Todd Lipcon added a comment -

        Fixed all the nits above except for the indentation - I didn't see any place with improper indentation.

        I think BM should distinguish between corrupt and out-of-dates replicas. The new case in processFirstBlockReport in thispatch, and where we mark reported RBW replicas for completed blocks as corrupt are using "corrupt" as a proxy for "please delete". I wasn't able to come up with additional bugs that with a similar cause but it would be easier to reason about if only truly corrupt replicas were marked as such. Can punt to a separate jira, if you agree.

        I don't entirely follow what you're getting at here... so let's open a new JIRA

        In FSNamesystem#isSafeModeTrackingBlocks, shouldn't we assert haEnabled is enabled if we're in SM and shouldIncrementallyTrackBlocks is true, instead of short-circuiting? We currently wouldn't know if we violate this condition because we'll return false if haEnabled.

        I did the check for haEnabled in FSNamesystem rather than SafeModeInfo, since when HA is enabled it means we can avoid the volatile read of safeModeInfo. This is to avoid having any impact on the HA case. Is that what you're referring to? Not sure specifically what you're asking for in this change...

        I changed setBlockTotal to only set shouldIncrementallyTrackBlocks to true when HA is enabled, and added assert haEnabled in adjustBlockTotals. Does that address your comment?

        Show
        Todd Lipcon added a comment - Fixed all the nits above except for the indentation - I didn't see any place with improper indentation. I think BM should distinguish between corrupt and out-of-dates replicas. The new case in processFirstBlockReport in thispatch, and where we mark reported RBW replicas for completed blocks as corrupt are using "corrupt" as a proxy for "please delete". I wasn't able to come up with additional bugs that with a similar cause but it would be easier to reason about if only truly corrupt replicas were marked as such. Can punt to a separate jira, if you agree. I don't entirely follow what you're getting at here... so let's open a new JIRA In FSNamesystem#isSafeModeTrackingBlocks, shouldn't we assert haEnabled is enabled if we're in SM and shouldIncrementallyTrackBlocks is true, instead of short-circuiting? We currently wouldn't know if we violate this condition because we'll return false if haEnabled. I did the check for haEnabled in FSNamesystem rather than SafeModeInfo, since when HA is enabled it means we can avoid the volatile read of safeModeInfo. This is to avoid having any impact on the HA case. Is that what you're referring to? Not sure specifically what you're asking for in this change... I changed setBlockTotal to only set shouldIncrementallyTrackBlocks to true when HA is enabled, and added assert haEnabled in adjustBlockTotals . Does that address your comment?
        Hide
        Eli Collins added a comment -

        I don't entirely follow what you're getting at here... so let's open a new JIRA

        See my last comment in HDFS-2791. If that makes sense we can follow up in a separate jira since it's not a
        new issue introduced in this change.

        I did the check for haEnabled in FSNamesystem rather than SafeModeInfo, since when HA is enabled it means we can avoid the volatile read of safeModeInfo. This is to avoid having any impact on the HA case. Is that what you're referring to?

        Yes, I was saying you can remove the check against haEnabled, didn't realize you were doing it as a performance optimization.

        I changed setBlockTotal to only set shouldIncrementallyTrackBlocks to true when HA is enabled, and added assert haEnabled in adjustBlockTotals. Does that address your comment?

        Yup, looks good, that's another way of asserting haEnabled if we're incrementally tracking blocks.

        Nit: NameNodeAdapter has a duplicate import of SafeModeInfo. Otherwise looks great, +1

        Show
        Eli Collins added a comment - I don't entirely follow what you're getting at here... so let's open a new JIRA See my last comment in HDFS-2791 . If that makes sense we can follow up in a separate jira since it's not a new issue introduced in this change. I did the check for haEnabled in FSNamesystem rather than SafeModeInfo, since when HA is enabled it means we can avoid the volatile read of safeModeInfo. This is to avoid having any impact on the HA case. Is that what you're referring to? Yes, I was saying you can remove the check against haEnabled, didn't realize you were doing it as a performance optimization. I changed setBlockTotal to only set shouldIncrementallyTrackBlocks to true when HA is enabled, and added assert haEnabled in adjustBlockTotals. Does that address your comment? Yup, looks good, that's another way of asserting haEnabled if we're incrementally tracking blocks. Nit: NameNodeAdapter has a duplicate import of SafeModeInfo. Otherwise looks great, +1
        Hide
        Todd Lipcon added a comment -

        Sanjay makes a good point above about this being less critical since HDFS-2791 was addressed. But there are still some test cases that come with this patch that fail without the bug fix. Let me write up a more thorough explanation for Sanjay of why I think this should still get done before I commit it.

        Show
        Todd Lipcon added a comment - Sanjay makes a good point above about this being less critical since HDFS-2791 was addressed. But there are still some test cases that come with this patch that fail without the bug fix. Let me write up a more thorough explanation for Sanjay of why I think this should still get done before I commit it.
        Hide
        Todd Lipcon added a comment -

        Here's an explanation of why this is still important after HDFS-2791:

        The crux of the issue is something I mentioned in an earlier comment:

        If instead we try to process the queued messages as soon as we first hear about the block, we have the opposite problem – step 3 and step 4 are switched. This causes problems for Safe Mode since it isn't properly accounting the number of complete blocks in that case. Hence the patch currently attached to this JIRA breaks TestHASafeMode.

        I've added two new unit tests which exercise this. In the first test, the sequence is the following:
        State: SBN is in safe mode
        1. Active NN opens some blocks for construction, writes OP_ADD to add them
        2. Active NN rolls. SBN picks up the OP_ADD from the shared edits
        3. Active NN closes the file, but doesn't roll edits.
        4. DNs report FINALIZED replicas to the SBN.
        Here the SBN does not increment safeBlockCount, since the block is UNDER_CONSTRUCTION.
        5. Active NN rolls. SBN replays OP_CLOSE. This increments the total block count, since we have one more finalized block. But it has to incrementally track that it has one more safe block as well – since it already had a finalized replica before it played OP_CLOSE. This is one of the fixes provided by HDFS-2742.

        The second test is the following:
        State: SBN is in safe mode
        1. Active NN and SBN both see some files, and both have received block reports.
        2. Active NN deletes the files, and writes OP_DELETE to the log. It does not yet send the deletion requests to the DNs.
        3. Active NN rolls the log. SBN picks up the OP_DELETE.
        4. SBN calls setBlockTotal, which sees that we now have fewer blocks in the namespace. This decreases the safemode "total" count. However, those blocks were also included in the "safe" count. So without this patch, the "safe" count ends up higher than the "total" and we crash with an assertion failure.

        The other benefit of this patch is that the fine-grained tracking of queued block messages is a lot more efficient. In particular:

        • the SBN will get "stuck" in safemode less often, since when it starts up, it can process the majority of the initial block reports, rather than queueing them all.
        • the SBN's memory won't "blow out" nearly as fast if it falls behind reading the edit logs, since only the newest modified blocks will have to get queued.
        Show
        Todd Lipcon added a comment - Here's an explanation of why this is still important after HDFS-2791 : The crux of the issue is something I mentioned in an earlier comment: If instead we try to process the queued messages as soon as we first hear about the block, we have the opposite problem – step 3 and step 4 are switched. This causes problems for Safe Mode since it isn't properly accounting the number of complete blocks in that case. Hence the patch currently attached to this JIRA breaks TestHASafeMode. I've added two new unit tests which exercise this. In the first test, the sequence is the following: State: SBN is in safe mode 1. Active NN opens some blocks for construction, writes OP_ADD to add them 2. Active NN rolls. SBN picks up the OP_ADD from the shared edits 3. Active NN closes the file, but doesn't roll edits. 4. DNs report FINALIZED replicas to the SBN. Here the SBN does not increment safeBlockCount, since the block is UNDER_CONSTRUCTION. 5. Active NN rolls. SBN replays OP_CLOSE. This increments the total block count, since we have one more finalized block. But it has to incrementally track that it has one more safe block as well – since it already had a finalized replica before it played OP_CLOSE. This is one of the fixes provided by HDFS-2742 . The second test is the following: State: SBN is in safe mode 1. Active NN and SBN both see some files, and both have received block reports. 2. Active NN deletes the files, and writes OP_DELETE to the log. It does not yet send the deletion requests to the DNs. 3. Active NN rolls the log. SBN picks up the OP_DELETE. 4. SBN calls setBlockTotal, which sees that we now have fewer blocks in the namespace. This decreases the safemode "total" count. However, those blocks were also included in the "safe" count. So without this patch, the "safe" count ends up higher than the "total" and we crash with an assertion failure. The other benefit of this patch is that the fine-grained tracking of queued block messages is a lot more efficient. In particular: the SBN will get "stuck" in safemode less often, since when it starts up, it can process the majority of the initial block reports, rather than queueing them all. the SBN's memory won't "blow out" nearly as fast if it falls behind reading the edit logs, since only the newest modified blocks will have to get queued.
        Hide
        Todd Lipcon added a comment -

        Attached patch adds the two extra tests described above, and addresses the dup import. Also rebased on tip of branch.

        Show
        Todd Lipcon added a comment - Attached patch adds the two extra tests described above, and addresses the dup import. Also rebased on tip of branch.
        Hide
        Todd Lipcon added a comment -

        Previous patch accidentally missed one of the earlier review passes by Eli (only enable incremental block tracking for ha NN). Re-incorporating it.

        Show
        Todd Lipcon added a comment - Previous patch accidentally missed one of the earlier review passes by Eli (only enable incremental block tracking for ha NN). Re-incorporating it.
        Hide
        Eli Collins added a comment -

        Todd, thanks for the detailed explanation! +1 to the latest patch.

        Show
        Eli Collins added a comment - Todd, thanks for the detailed explanation! +1 to the latest patch.
        Hide
        Eli Collins added a comment -

        I've committed this.

        Show
        Eli Collins added a comment - I've committed this.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #65 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/65/)
        HDFS-2742. HA: observed dataloss in replication stress test. Contributed by Todd Lipcon

        eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1238940
        Files :

        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfo.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingDataNodeMessages.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/Namesystem.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/PendingDataNodeMessages.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestPendingDataNodeMessages.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #65 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/65/ ) HDFS-2742 . HA: observed dataloss in replication stress test. Contributed by Todd Lipcon eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1238940 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfo.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingDataNodeMessages.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/Namesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/PendingDataNodeMessages.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestPendingDataNodeMessages.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development