Hadoop Common
  1. Hadoop Common
  2. HADOOP-4742

Mistake delete replica in hadoop 0.18.1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.1
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Environment:

      CentOS 5.2, JDK 1.6,
      16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet

    • Hadoop Flags:
      Reviewed

      Description

      We recently deployed a 0.18.1 cluster and did some test. And we found
      if we corrupt a block, the namenode will find it and replicate it as soon as
      a client read that block. However, the namenode will delete a health block
      (the source of the above replication operation) at the same time, (I think this
      issue may affect all 0.18 tree.)

      Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
      check the number of replications after add the block to blocksMap:

      NumberReplicas num = countNodes(storedBlock);
      int numLiveReplicas = num.liveReplicas();
      int numCurrentReplica = numLiveReplicas
      + pendingReplications.getNumReplicas(block);

      which means all the live replicas and pending replications will be
      counted. But in the end of FSNamesystem.blockReceived(), which
      calls the addStoredBlock(), it will call addStoredBlock() first, then
      reduce the pendingReplications count.

      //
      // Modify the blocks->datanode map and node's map.
      //
      addStoredBlock(block, node, delHintNode );
      pendingReplications.remove(block);

      Hence, the newly replicated replica will be counted twice, and then
      will be marked as excess and lead to a mistake deletion.

      I think change the counting lines in blockReceived(), may solve this
      issue:

      — FSNamesystem.java-orig 2008-11-28 13:34:40.000000000 +0800
      +++ FSNamesystem.java 2008-11-28 13:54:12.000000000 +0800
      @@ -3152,8 +3152,8 @@
      //
      // Modify the blocks->datanode map and node's map.
      //

      • addStoredBlock(block, node, delHintNode );
        pendingReplications.remove(block);
        + addStoredBlock(block, node, delHintNode );
        }

      long[] getStats() throws IOException {

      The following is the logs for the mistake deletion, with additional
      logging info inserted by me.

      2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: DIR
      NameNode.reportBadBlocks
      2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
      NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
      corrupt on 192.168.33.51:50010 by /192.168.33.51
      2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
      datanode(s) 192.168.33.45:50010
      2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
      added to blk_3828935579548953768_1184 size 67108864
      2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
      Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
      pendings
      2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
      NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
      192.168.33.51:50010
      2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
      192.168.33.51:50010
      2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      ask 192.168.33.44:50010 to delete blk_3828935579548953768_1184
      2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
      ask 192.168.33.51:50010 to delete blk_3828935579548953768_1184

      1. HADOOP-4742.diff
        0.6 kB
        Wang Xu
      2. blockReceived-br18.patch
        0.5 kB
        Hairong Kuang
      3. blockReceived.patch
        0.6 kB
        Hairong Kuang

        Activity

        Owen O'Malley made changes -
        Component/s dfs [ 12310710 ]
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #680 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/ )
        Hide
        Wang Xu added a comment -

        Thanks Hairong! I learned much the issue progress of hadoop here. And I think I will do more next time.

        Show
        Wang Xu added a comment - Thanks Hairong! I learned much the issue progress of hadoop here. And I think I will do more next time.
        Hairong Kuang made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Hairong Kuang added a comment -

        I've just committed this. Thanks you, Wang!

        Show
        Hairong Kuang added a comment - I've just committed this. Thanks you, Wang!
        Hide
        Hairong Kuang added a comment -

        ant test-core succeded:
        BUILD SUCCESSFUL
        Total time: 115 minutes 14 seconds

        ant test-patch result:
        [exec] -1 overall.

        [exec] +1 @author. The patch does not contain any @author tags.

        [exec] -1 tests included. The patch doesn't appear to include any new or modified tes
        ts.
        [exec] Please justify why no tests are needed for this patch.

        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

        [exec] +1 javac. The applied patch does not increase the total number of javac compil
        er warnings.

        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        Show
        Hairong Kuang added a comment - ant test-core succeded: BUILD SUCCESSFUL Total time: 115 minutes 14 seconds ant test-patch result: [exec] -1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] -1 tests included. The patch doesn't appear to include any new or modified tes ts. [exec] Please justify why no tests are needed for this patch. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compil er warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        Hairong Kuang made changes -
        Attachment blockReceived.patch [ 12395235 ]
        Hairong Kuang made changes -
        Attachment blockReceived.patch [ 12395233 ]
        Hairong Kuang made changes -
        Attachment blockReceived-br18.patch [ 12395234 ]
        Hide
        Hairong Kuang added a comment -

        This is the patch for branch 0.18.

        Show
        Hairong Kuang added a comment - This is the patch for branch 0.18.
        Hairong Kuang made changes -
        Attachment blockReceived.patch [ 12395233 ]
        Hide
        Hairong Kuang added a comment -

        Thanks Wang for your contribution. I redid the patch against the trunk.

        Show
        Hairong Kuang added a comment - Thanks Wang for your contribution. I redid the patch against the trunk.
        Wang Xu made changes -
        Attachment HADOOP-4742.diff [ 12395068 ]
        Wang Xu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Doug Cutting made changes -
        Assignee Hairong Kuang [ hairong ] Wang Xu [ gnawux ]
        Hide
        Hairong Kuang added a comment -

        Yes, I think this is indeed a problem. The proposed solution should be able to fix the problem.

        Show
        Hairong Kuang added a comment - Yes, I think this is indeed a problem. The proposed solution should be able to fix the problem.
        Nigel Daley made changes -
        Field Original Value New Value
        Priority Major [ 3 ] Blocker [ 1 ]
        Assignee Hairong Kuang [ hairong ]
        Fix Version/s 0.18.3 [ 12313494 ]
        Wang Xu created issue -

          People

          • Assignee:
            Wang Xu
            Reporter:
            Wang Xu
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development