Hadoop Common
  1. Hadoop Common
  2. HADOOP-4742

Mistake delete replica in hadoop 0.18.1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.1
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Environment:

      CentOS 5.2, JDK 1.6,
      16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet

    • Hadoop Flags:
      Reviewed

      Description

      We recently deployed a 0.18.1 cluster and did some test. And we found
      if we corrupt a block, the namenode will find it and replicate it as soon as
      a client read that block. However, the namenode will delete a health block
      (the source of the above replication operation) at the same time, (I think this
      issue may affect all 0.18 tree.)

      Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
      check the number of replications after add the block to blocksMap:

      NumberReplicas num = countNodes(storedBlock);
      int numLiveReplicas = num.liveReplicas();
      int numCurrentReplica = numLiveReplicas
      + pendingReplications.getNumReplicas(block);

      which means all the live replicas and pending replications will be
      counted. But in the end of FSNamesystem.blockReceived(), which
      calls the addStoredBlock(), it will call addStoredBlock() first, then
      reduce the pendingReplications count.

      //
      // Modify the blocks->datanode map and node's map.
      //
      addStoredBlock(block, node, delHintNode );
      pendingReplications.remove(block);

      Hence, the newly replicated replica will be counted twice, and then
      will be marked as excess and lead to a mistake deletion.

      I think change the counting lines in blockReceived(), may solve this
      issue:

      — FSNamesystem.java-orig 2008-11-28 13:34:40.000000000 +0800
      +++ FSNamesystem.java 2008-11-28 13:54:12.000000000 +0800
      @@ -3152,8 +3152,8 @@
      //
      // Modify the blocks->datanode map and node's map.
      //

      • addStoredBlock(block, node, delHintNode );
        pendingReplications.remove(block);
        + addStoredBlock(block, node, delHintNode );
        }

      long[] getStats() throws IOException {

      The following is the logs for the mistake deletion, with additional
      logging info inserted by me.

      2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: DIR
      NameNode.reportBadBlocks
      2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
      NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
      corrupt on 192.168.33.51:50010 by /192.168.33.51
      2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
      datanode(s) 192.168.33.45:50010
      2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
      added to blk_3828935579548953768_1184 size 67108864
      2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
      Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
      pendings
      2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
      NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
      192.168.33.51:50010
      2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
      192.168.33.51:50010
      2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      ask 192.168.33.44:50010 to delete blk_3828935579548953768_1184
      2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      ask 192.168.33.51:50010 to delete blk_3828935579548953768_1184

      1. HADOOP-4742.diff
        0.6 kB
        Wang Xu
      2. blockReceived-br18.patch
        0.5 kB
        Hairong Kuang
      3. blockReceived.patch
        0.6 kB
        Hairong Kuang

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Wang Xu
            Reporter:
            Wang Xu
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development