Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1260

0.20: Block lost when multiple DNs trying to recover it to different genstamps

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20-append
    • Fix Version/s: 0.20.205.0
    • Component/s: None
    • Labels:
      None

      Description

      Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following:

      • FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093
      • we find the block file and meta file based on block ID only, without comparing gen stamp
      • we rename the meta file to the new genstamp _7094
      • in updateBlockMap, we do comparison in the volumeMap by oldblock without wildcard GS, so it does not update volumeMap
      • validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not exist in blocks map"

      After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata

      Making a unit test for this is probably going to be difficult, but doable.

      1. hdfs-1260.txt
        6 kB
        Todd Lipcon
      2. hdfs-1260.txt
        7 kB
        Todd Lipcon
      3. simultaneous-recoveries.txt
        465 kB
        Todd Lipcon
      4. HDFS-1260-20S.3.patch
        7 kB
        Jitendra Nath Pandey

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          To confirm the suspicion above, I had the operator rename the meta block back to _7094, and the next recovery attempt succeeded.

          Show
          Todd Lipcon added a comment - To confirm the suspicion above, I had the operator rename the meta block back to _7094, and the next recovery attempt succeeded.
          Hide
          Todd Lipcon added a comment -

          err.,.. sorry... rename the meta block back to _7093

          Show
          Todd Lipcon added a comment - err.,.. sorry... rename the meta block back to _7093
          Hide
          Todd Lipcon added a comment -

          Simple patch which reorders tryUpdateBlock's verification vs rename.

          Show
          Todd Lipcon added a comment - Simple patch which reorders tryUpdateBlock's verification vs rename.
          Hide
          sam rash added a comment -

          about the testing, any reason not to use one of the adapters instead of making this public?

          public long nextGenerationStampForBlock(Block block) throws IOException  {
          

          sorry, i'm a stickler for visibility/encapsulation bits when i can be

          Show
          sam rash added a comment - about the testing, any reason not to use one of the adapters instead of making this public? public long nextGenerationStampForBlock(Block block) throws IOException { sorry, i'm a stickler for visibility/encapsulation bits when i can be
          Hide
          Nicolas Spiegelberg added a comment -

          +1 on the fix. good catch

          Show
          Nicolas Spiegelberg added a comment - +1 on the fix. good catch
          Hide
          sam rash added a comment -

          oh, other than that, lgtm

          Show
          sam rash added a comment - oh, other than that, lgtm
          Hide
          Todd Lipcon added a comment -

          Here's a patch that moves the call over to the adapter. Also added a bit of javadoc to DelayAnswer

          Show
          Todd Lipcon added a comment - Here's a patch that moves the call over to the adapter. Also added a bit of javadoc to DelayAnswer
          Hide
          sam rash added a comment -

          yea, looks good. at some point, does it make sense to move the DelayAnswer class out? it seems generally useful (not this patch, but just thinking)

          Show
          sam rash added a comment - yea, looks good. at some point, does it make sense to move the DelayAnswer class out? it seems generally useful (not this patch, but just thinking)
          Hide
          Todd Lipcon added a comment -

          Yea, we could move it to a MockitoUtil class or something. Let's tackle that when we move all these tests forward to trunk (I plan to do that in July hopefully)

          Show
          Todd Lipcon added a comment - Yea, we could move it to a MockitoUtil class or something. Let's tackle that when we move all these tests forward to trunk (I plan to do that in July hopefully)
          Hide
          dhruba borthakur added a comment -

          This patch looks perfect to me. +1

          Show
          dhruba borthakur added a comment - This patch looks perfect to me. +1
          Hide
          Todd Lipcon added a comment -

          After months of running this test I ran into this failure attached above. One of the DNs somehow ends up with multiple meta files for the same block, but at different generation stamps.

          I think the issue is in the implementation of DataNode.updateBlock(). The block passed in doesn't have a wildcard generation stamp, but we don't care - we go and find the block on disk without matching generation stamps. I think this is OK based on the validation logic - we still only move blocks forward in GS-time, and don't revert length. However, when we then call updateBlockMap() it doesn't use a wildcard generation stamp, so the block can get left in the block map with the old generation stamp. This inconsistency I think cascades into the sort of failure seen in the attached log.

          I think the solution is:

          • Change updateBlock to call updateBlockMap with a wildcard generation stamp key
          • Change the interruption code to use a wildcard GS block when interrupting concurrent writers

          I will make these changes and see if the rest of the unit tests still pass, then see if I can come up with a regression test.

          Show
          Todd Lipcon added a comment - After months of running this test I ran into this failure attached above. One of the DNs somehow ends up with multiple meta files for the same block, but at different generation stamps. I think the issue is in the implementation of DataNode.updateBlock(). The block passed in doesn't have a wildcard generation stamp, but we don't care - we go and find the block on disk without matching generation stamps. I think this is OK based on the validation logic - we still only move blocks forward in GS-time, and don't revert length. However, when we then call updateBlockMap() it doesn't use a wildcard generation stamp, so the block can get left in the block map with the old generation stamp. This inconsistency I think cascades into the sort of failure seen in the attached log. I think the solution is: Change updateBlock to call updateBlockMap with a wildcard generation stamp key Change the interruption code to use a wildcard GS block when interrupting concurrent writers I will make these changes and see if the rest of the unit tests still pass, then see if I can come up with a regression test.
          Hide
          dhruba borthakur added a comment -

          > However, when we then call updateBlockMap() it doesn't use a wildcard generation stamp,

          yes, that seems to be a bug to me. Great catch!

          Show
          dhruba borthakur added a comment - > However, when we then call updateBlockMap() it doesn't use a wildcard generation stamp, yes, that seems to be a bug to me. Great catch!
          Hide
          Jitendra Nath Pandey added a comment -

          Patch ported to 20-security.

          Show
          Jitendra Nath Pandey added a comment - Patch ported to 20-security.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch.

          Show
          Suresh Srinivas added a comment - +1 for the patch.
          Hide
          Todd Lipcon added a comment -

          This was committed to 0.20.205, resolving JIRA

          Show
          Todd Lipcon added a comment - This was committed to 0.20.205, resolving JIRA
          Hide
          Matt Foley added a comment -

          Todd, do we need this in trunk also?

          Show
          Matt Foley added a comment - Todd, do we need this in trunk also?
          Hide
          Todd Lipcon added a comment -

          I don't believe so - I think the new append design in trunk prevents this issue. There is an existing open JIRA against trunk about forward-porting all append-related test cases to the new trunk implementation to be sure the new design doesn't suffer from the same issues.

          Show
          Todd Lipcon added a comment - I don't believe so - I think the new append design in trunk prevents this issue. There is an existing open JIRA against trunk about forward-porting all append-related test cases to the new trunk implementation to be sure the new design doesn't suffer from the same issues.
          Hide
          Matt Foley added a comment -

          Closed upon release of 0.20.205.0

          Show
          Matt Foley added a comment - Closed upon release of 0.20.205.0

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development