Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1260

0.20: Block lost when multiple DNs trying to recover it to different genstamps

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20-append
    • Fix Version/s: 0.20.205.0
    • Component/s: None
    • Labels:
      None

      Description

      Saw this issue on a cluster where some ops people were doing network changes without shutting down DNs first. So, recovery ended up getting started at multiple different DNs at the same time, and some race condition occurred that caused a block to get permanently stuck in recovery mode. What seems to have happened is the following:

      • FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094, while the block in the volumeMap (and on filesystem) was genstamp 7093
      • we find the block file and meta file based on block ID only, without comparing gen stamp
      • we rename the meta file to the new genstamp _7094
      • in updateBlockMap, we do comparison in the volumeMap by oldblock without wildcard GS, so it does not update volumeMap
      • validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not exist in blocks map"

      After this point, all future recovery attempts to that node fail in getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock (since the meta file got renamed above) and then fails since _7094 isn't in volumeMap in validateBlockMetadata

      Making a unit test for this is probably going to be difficult, but doable.

      1. HDFS-1260-20S.3.patch
        7 kB
        Jitendra Nath Pandey
      2. simultaneous-recoveries.txt
        465 kB
        Todd Lipcon
      3. hdfs-1260.txt
        7 kB
        Todd Lipcon
      4. hdfs-1260.txt
        6 kB
        Todd Lipcon

        Issue Links

          Activity

          Todd Lipcon created issue -
          Todd Lipcon made changes -
          Field Original Value New Value
          Link This issue is related to HDFS-1231 [ HDFS-1231 ]
          Todd Lipcon made changes -
          Attachment hdfs-1260.txt [ 12447910 ]
          Todd Lipcon made changes -
          Attachment hdfs-1260.txt [ 12447915 ]
          dhruba borthakur made changes -
          Link This issue is related to HDFS-1263 [ HDFS-1263 ]
          Todd Lipcon made changes -
          Attachment simultaneous-recoveries.txt [ 12453075 ]
          Jitendra Nath Pandey made changes -
          Attachment HDFS-1260-20S.3.patch [ 12493885 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 0.20.205.0 [ 12316392 ]
          Fix Version/s 0.20-append [ 12315103 ]
          Resolution Fixed [ 1 ]
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development