Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3368

Missing blocks due to bad DataNodes coming up and down.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0, 1.0.0, 3.0.0
    • Fix Version/s: 0.22.1, 2.0.2-alpha
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      All replicas of a block can be removed if bad DataNodes come up and down during cluster restart resulting in data loss.

      1. blockDeletePolicy.patch
        3 kB
        Konstantin Shvachko
      2. blockDeletePolicy-0.22.patch
        7 kB
        Konstantin Shvachko
      3. blockDeletePolicy-trunk.patch
        8 kB
        Konstantin Shvachko
      4. blockDeletePolicy-0.22.patch
        9 kB
        Konstantin Shvachko
      5. blockDeletePolicy-trunk.patch
        10 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Eli Collins added a comment -

          I mean HDFS-3690.

          Show
          Eli Collins added a comment - I mean HDFS-3690 .
          Hide
          Eli Collins added a comment -

          This change broke the common tests, eg TestCopyListing, filed HDFS-3197.

          Show
          Eli Collins added a comment - This change broke the common tests, eg TestCopyListing, filed HDFS-3197 .
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-22-branch #130 (See https://builds.apache.org/job/Hadoop-Hdfs-22-branch/130/)
          HDFS-3368. Missing blocks due to bad DataNodes coming up and down. Contributed by Konstantin Shvachko. (Revision 1342521)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342521
          Files :

          • /hadoop/common/branches/branch-0.22/hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.22/hdfs/src/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/branches/branch-0.22/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java
          • /hadoop/common/branches/branch-0.22/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestOverReplicatedBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #130 (See https://builds.apache.org/job/Hadoop-Hdfs-22-branch/130/ ) HDFS-3368 . Missing blocks due to bad DataNodes coming up and down. Contributed by Konstantin Shvachko. (Revision 1342521) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342521 Files : /hadoop/common/branches/branch-0.22/hdfs/CHANGES.txt /hadoop/common/branches/branch-0.22/hdfs/src/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/branches/branch-0.22/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java /hadoop/common/branches/branch-0.22/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestOverReplicatedBlocks.java
          Hide
          Konstantin Shvachko added a comment -

          Resolving. Thanks Nicholas for review and help.

          Show
          Konstantin Shvachko added a comment - Resolving. Thanks Nicholas for review and help.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Merged this to 2.0.0.

          Oops, it is branch-2 not 2.0.0.

          Show
          Tsz Wo Nicholas Sze added a comment - > Merged this to 2.0.0. Oops, it is branch-2 not 2.0.0.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Merged this to 2.0.0.

          Show
          Tsz Wo Nicholas Sze added a comment - Merged this to 2.0.0.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #2307 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2307/)
          HDFS-3368. Missing blocks due to bad DataNodes coming up and down. (Revision 1342512)

          Result = FAILURE
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2307 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2307/ ) HDFS-3368 . Missing blocks due to bad DataNodes coming up and down. (Revision 1342512) Result = FAILURE shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #2288 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2288/)
          HDFS-3368. Missing blocks due to bad DataNodes coming up and down. (Revision 1342512)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2288 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2288/ ) HDFS-3368 . Missing blocks due to bad DataNodes coming up and down. (Revision 1342512) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #2361 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2361/)
          HDFS-3368. Missing blocks due to bad DataNodes coming up and down. (Revision 1342512)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2361 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2361/ ) HDFS-3368 . Missing blocks due to bad DataNodes coming up and down. (Revision 1342512) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1090 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1090/)
          HDFS-3368. Missing blocks due to bad DataNodes coming up and down. (Revision 1342512)

          Result = ABORTED
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1090 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1090/ ) HDFS-3368 . Missing blocks due to bad DataNodes coming up and down. (Revision 1342512) Result = ABORTED shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Hide
          Harsh J added a comment -

          (Updating Fix Version Fields for current state)

          Show
          Harsh J added a comment - (Updating Fix Version Fields for current state)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1056 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1056/)
          HDFS-3368. Missing blocks due to bad DataNodes coming up and down. (Revision 1342512)

          Result = SUCCESS
          shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1056 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1056/ ) HDFS-3368 . Missing blocks due to bad DataNodes coming up and down. (Revision 1342512) Result = SUCCESS shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1342512 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestOverReplicatedBlocks.java
          Hide
          Konstantin Shvachko added a comment -

          I committed to trunk and branch-0.22.
          Could you guys commit to 2.0, my old box has space only for two versions. I think the trunk patch will not require any changes.

          Show
          Konstantin Shvachko added a comment - I committed to trunk and branch-0.22. Could you guys commit to 2.0, my old box has space only for two versions. I think the trunk patch will not require any changes.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 good work! Thanks, Konstantin.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 good work! Thanks, Konstantin.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12528559/blockDeletePolicy-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2516//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2516//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528559/blockDeletePolicy-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2516//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2516//console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          No, all failures are unrelated to the patch.
          I looked through the Jenkins logs.

          1. org.apache.hadoop.hdfs.TestDFSClientRetries.testGetFileChecksum
            This one failes because previous test sets xceiver count in config to 2 and never resets it back. So creation of a large file in testGetFileChecksum eventually fails, because DNs refuse to add more xceiver threads.
            java.io.IOException: Xceiver count 3 exceeds the limit of concurrent xcievers: 2
            	at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:143)
            	at java.lang.Thread.run(Thread.java:662)
            
          2. org.apache.hadoop.hdfs.TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy1
            Failes because DFSTestUtil.waitCorruptReplicas() is timing- / delay- sensitive.
            It reads some file 50 times and checks if the corruption is detected after each read.
            That time was enough for the DN to restart, but not enough for NN to detect the corruption.
            Looking for "NameSystem.addToCorruptReplicasMap:" and it is not in the logs.
            By the way testBlockCorruptionRecoveryPolicy2 which corrupts 2 replicas onstead of one worked fine.
          3. org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks.testCorruptBlockRereplicatedAcrossRacks
            failes for the same reason. I see fifty "Waiting for 1 corrupt replicas", which means 50 read have been done, but no "addToCorruptReplicasMap" indicating that corruption was not detected.

          I can file jiras for that.

          Resubmitted the build in case I missed something.

          Show
          Konstantin Shvachko added a comment - No, all failures are unrelated to the patch. I looked through the Jenkins logs. org.apache.hadoop.hdfs.TestDFSClientRetries.testGetFileChecksum This one failes because previous test sets xceiver count in config to 2 and never resets it back. So creation of a large file in testGetFileChecksum eventually fails, because DNs refuse to add more xceiver threads. java.io.IOException: Xceiver count 3 exceeds the limit of concurrent xcievers: 2 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:143) at java.lang. Thread .run( Thread .java:662) org.apache.hadoop.hdfs.TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy1 Failes because DFSTestUtil.waitCorruptReplicas() is timing- / delay- sensitive. It reads some file 50 times and checks if the corruption is detected after each read. That time was enough for the DN to restart, but not enough for NN to detect the corruption. Looking for "NameSystem.addToCorruptReplicasMap:" and it is not in the logs. By the way testBlockCorruptionRecoveryPolicy2 which corrupts 2 replicas onstead of one worked fine. org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks.testCorruptBlockRereplicatedAcrossRacks failes for the same reason. I see fifty "Waiting for 1 corrupt replicas", which means 50 read have been done, but no "addToCorruptReplicasMap" indicating that corruption was not detected. I can file jiras for that. Resubmitted the build in case I missed something.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Patch looks good but the failed tests seem related.

          Show
          Tsz Wo Nicholas Sze added a comment - Patch looks good but the failed tests seem related.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12528559/blockDeletePolicy-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
          org.apache.hadoop.hdfs.TestDFSClientRetries
          org.apache.hadoop.hdfs.TestDatanodeBlockScanner

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2503//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2503//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528559/blockDeletePolicy-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks org.apache.hadoop.hdfs.TestDFSClientRetries org.apache.hadoop.hdfs.TestDatanodeBlockScanner +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2503//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2503//console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          Updated patches to include review comments and the configuration parameter for the multiplier.

          Show
          Konstantin Shvachko added a comment - Updated patches to include review comments and the configuration parameter for the multiplier.
          Hide
          Aaron T. Myers added a comment -

          I think the problem is not in Hadoop having too many configuration options per se, but rather in having too many that one needs to change. The key difference being that we just need to give it a good default value. Making it undocumented is fine and perhaps even desirable, since you're right - most people will never need or want to change it. But, if someone (maybe me or you, one day) does find some need to change it, they'll be very happy they're able to do so without either recompiling or waiting for the next Hadoop release. I see no benefit to having it hard-coded as opposed to an undocumented config parameter, and some potential benefit to having it configurable.

          Show
          Aaron T. Myers added a comment - I think the problem is not in Hadoop having too many configuration options per se, but rather in having too many that one needs to change . The key difference being that we just need to give it a good default value. Making it undocumented is fine and perhaps even desirable, since you're right - most people will never need or want to change it. But, if someone (maybe me or you, one day) does find some need to change it, they'll be very happy they're able to do so without either recompiling or waiting for the next Hadoop release. I see no benefit to having it hard-coded as opposed to an undocumented config parameter, and some potential benefit to having it configurable.
          Hide
          Konstantin Shvachko added a comment -

          I incorporated Nicholas's comments.
          Double checking on the configuration parameter before changing it.
          Aaron, sounds like you would generally prefer configuration parameters to constants. But this is not always a good approach. There were many discussions in the past about it. People are getting confused if you give them too many parameters that they have to understand in order to change. This one is hardly to be ever changed by anybody.
          So I am asking whether you feel strong about making it an undocumented config parameter.

          Show
          Konstantin Shvachko added a comment - I incorporated Nicholas's comments. Double checking on the configuration parameter before changing it. Aaron, sounds like you would generally prefer configuration parameters to constants. But this is not always a good approach. There were many discussions in the past about it. People are getting confused if you give them too many parameters that they have to understand in order to change. This one is hardly to be ever changed by anybody. So I am asking whether you feel strong about making it an undocumented config parameter.
          Hide
          Aaron T. Myers added a comment -

          I was thinking about making it configurable, but decided not to introduce yet another parameter. The value is chosen based on experimental data. If you feel strongly about making it a configuration key, I can add an undocumented parameter.

          IMO, we should add a configuration option for it. Even if it's unlikely to change, if someone does want to change it they'll thank us that they don't have to change the code/recompile to do so.

          Show
          Aaron T. Myers added a comment - I was thinking about making it configurable, but decided not to introduce yet another parameter. The value is chosen based on experimental data. If you feel strongly about making it a configuration key, I can add an undocumented parameter. IMO, we should add a configuration option for it. Even if it's unlikely to change, if someone does want to change it they'll thank us that they don't have to change the code/recompile to do so.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I just found that the enableDebugLogging constant is outdated. It still refers to "FSNamesystem logger" but the actually logger is BlockPlacementPolicy.LOG. Could you also update it? Below is my suggested change.

           public class BlockPlacementPolicyDefault extends BlockPlacementPolicy {
          +  private static final String enableDebugLogging
          +      = "For more information, please enable DEBUG log level on "
          +          + ((Log4JLogger)LOG).getLogger().getName();
          +
             private boolean considerLoad; 
             private boolean preferLocalNode = true;
             private NetworkTopology clusterMap;
             private FSClusterStats stats;
          -  static final String enableDebugLogging = "For more information, please enable"
          -    + " DEBUG level logging on the "
          -    + "org.apache.hadoop.hdfs.server.namenode.FSNamesystem logger.";
          
          Show
          Tsz Wo Nicholas Sze added a comment - I just found that the enableDebugLogging constant is outdated. It still refers to "FSNamesystem logger" but the actually logger is BlockPlacementPolicy.LOG. Could you also update it? Below is my suggested change. public class BlockPlacementPolicyDefault extends BlockPlacementPolicy { + private static final String enableDebugLogging + = "For more information, please enable DEBUG log level on " + + ((Log4JLogger)LOG).getLogger().getName(); + private boolean considerLoad; private boolean preferLocalNode = true ; private NetworkTopology clusterMap; private FSClusterStats stats; - static final String enableDebugLogging = "For more information, please enable" - + " DEBUG level logging on the " - + "org.apache.hadoop.hdfs.server.namenode.FSNamesystem logger." ;
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Typo: like => likely

          Show
          Tsz Wo Nicholas Sze added a comment - Typo: like => likely
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If you think that it is not like to change the value of the multiplier, then we don't have to make it configurable.

          Show
          Tsz Wo Nicholas Sze added a comment - If you think that it is not like to change the value of the multiplier, then we don't have to make it configurable.
          Hide
          Konstantin Shvachko added a comment -

          Thanks for the review.
          Name change makes sense.
          I was thinking about making it configurable, but decided not to introduce yet another parameter. The value is chosen based on experimental data. If you feel strongly about making it a configuration key, I can add an undocumented parameter.

          Show
          Konstantin Shvachko added a comment - Thanks for the review. Name change makes sense. I was thinking about making it configurable, but decided not to introduce yet another parameter. The value is chosen based on experimental data. If you feel strongly about making it a configuration key, I can add an undocumented parameter.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If a datanode somehow has a long heartbeat interval, it does indicate that the datanode may have some problem. Also, the patch is quite simple. So, I am fine on changing the default replication policy. Some comments on the patch:

          • How about renaming DFS_TOLERATE_HEARTBEAT_MISSES to DFS_TOLERATE_HEARTBEAT_MULTIPLIER?
          • Since we are not sure what is a good choice of the multiplier, how about making it configurable?
          Show
          Tsz Wo Nicholas Sze added a comment - If a datanode somehow has a long heartbeat interval, it does indicate that the datanode may have some problem. Also, the patch is quite simple. So, I am fine on changing the default replication policy. Some comments on the patch: How about renaming DFS_TOLERATE_HEARTBEAT_MISSES to DFS_TOLERATE_HEARTBEAT_MULTIPLIER? Since we are not sure what is a good choice of the multiplier, how about making it configurable?
          Hide
          Konstantin Shvachko added a comment -

          This is pretty rare but when you hit it it takes a while to figure out what went wrong. If not fixed the problem becomes a maintenance issue, that is ops will have to remember to add every failed node to the exclude list, which sometimes is not obvious and definitely time consuming.

          Block allocation does not take into account heartbeats. As you know there are other mechanisms there, like DN load. Even if new replica is assigned to a node that has recently gone down, this will be detected during data transfer and a new location will be assigned.
          Don't see how it should correlate with the delete policy.

          Show
          Konstantin Shvachko added a comment - This is pretty rare but when you hit it it takes a while to figure out what went wrong. If not fixed the problem becomes a maintenance issue, that is ops will have to remember to add every failed node to the exclude list, which sometimes is not obvious and definitely time consuming. Block allocation does not take into account heartbeats. As you know there are other mechanisms there, like DN load. Even if new replica is assigned to a node that has recently gone down, this will be detected during data transfer and a new location will be assigned. Don't see how it should correlate with the delete policy.
          Hide
          Suresh Srinivas added a comment - - edited

          Your question about complexity is similar to one of why do we bother introducing all the HA complexity if all NameNodes primary and standby can fail at once.

          I do not see the relationship of this to my comments.

          I am not saying this problem should not be solved. Given how rare it is, why change the default behavior. If others feel this makes sense, I am okay with it. Note that the problem is solved only for some cases, that is excess replica deletion and not during block allocation.

          Show
          Suresh Srinivas added a comment - - edited Your question about complexity is similar to one of why do we bother introducing all the HA complexity if all NameNodes primary and standby can fail at once. I do not see the relationship of this to my comments. I am not saying this problem should not be solved. Given how rare it is, why change the default behavior. If others feel this makes sense, I am okay with it. Note that the problem is solved only for some cases, that is excess replica deletion and not during block allocation.
          Hide
          Konstantin Shvachko added a comment -

          You are right a failure of three random nodes leads to a data loss. We know that and cannot do anything about it.

          The case is different here. The cluster has 6 replicas and ends up with 0 as the result of the current policy, which makes it almost inevitable. This can be avoided by a slight modification in the policy making it smarter about potentially flaky nodes.

          Your question about complexity is similar to one of why do we bother introducing all the HA complexity if all NameNodes primary and standby can fail at once.

          Show
          Konstantin Shvachko added a comment - You are right a failure of three random nodes leads to a data loss. We know that and cannot do anything about it. The case is different here. The cluster has 6 replicas and ends up with 0 as the result of the current policy , which makes it almost inevitable . This can be avoided by a slight modification in the policy making it smarter about potentially flaky nodes. Your question about complexity is similar to one of why do we bother introducing all the HA complexity if all NameNodes primary and standby can fail at once.
          Hide
          Suresh Srinivas added a comment -

          They are not chosen for new blocks. This is a different scenario.

          I am saying if such flaky nodes are around, the problem I described can happen as well.

          My concern is, is it necessary to add this complexity into default policy?

          Show
          Suresh Srinivas added a comment - They are not chosen for new blocks. This is a different scenario. I am saying if such flaky nodes are around, the problem I described can happen as well. My concern is, is it necessary to add this complexity into default policy?
          Hide
          Konstantin Shvachko added a comment -

          > d01, do2, do3 are chosen for adding new block.

          They are not chosen for new blocks. This is a different scenario.
          do[1-3] went down long time ago (and all blocks were replicated out to other nodes), but were not put into exclude list.
          On cluster restart do[1-3] are brought up along with dn[1-3]. So for a brief period of time the block had 6 replicas. 3 of them need to be deleted. Because of the current default policy in place the replicas will be chosen to be deleted from dn[1-3], because those have less free space. do[1-3] are flaky and die shortly after sending block reports on restart. So 10 minutes later all 6 replicas will be gone.
          Just as I described in my first comment. The bug is in the default policy. I'm not defining a new one.

          Show
          Konstantin Shvachko added a comment - > d01, do2, do3 are chosen for adding new block. They are not chosen for new blocks. This is a different scenario. do [1-3] went down long time ago (and all blocks were replicated out to other nodes), but were not put into exclude list. On cluster restart do [1-3] are brought up along with dn [1-3] . So for a brief period of time the block had 6 replicas. 3 of them need to be deleted. Because of the current default policy in place the replicas will be chosen to be deleted from dn [1-3] , because those have less free space. do [1-3] are flaky and die shortly after sending block reports on restart. So 10 minutes later all 6 replicas will be gone. Just as I described in my first comment. The bug is in the default policy. I'm not defining a new one.
          Hide
          Suresh Srinivas added a comment -

          Sorry, I should have added more details to my comment:
          In your description of the problem first the failure is one by one - "At different times all three nodes malfunctioned and died, causing the replicas to migrate to dn1, dn2, dn3." Later the failure is together in a short time "Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN".

          While you change how you choose the replicas to delete, the presence of nodes like do1, do2 and do3 means that the following scenario is possible:

          • d01, do2, do3 are chosen for adding new block.
          • client adds a block to these nodes.
          • shortly all do1, do2, do3 go down shortly.
            Now the replicas are no longer available.

          HDFS multiple replicas assumes the probability of three nodes having same replicas going down altogether in a short time is low. Given that not sure if this problem is important enough.

          Alternatively, given block placement policy is pluggable, you could write a custom implementation and not change the default implementation?

          Show
          Suresh Srinivas added a comment - Sorry, I should have added more details to my comment: In your description of the problem first the failure is one by one - "At different times all three nodes malfunctioned and died, causing the replicas to migrate to dn1, dn2, dn3." Later the failure is together in a short time "Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN". While you change how you choose the replicas to delete, the presence of nodes like do1, do2 and do3 means that the following scenario is possible: d01, do2, do3 are chosen for adding new block. client adds a block to these nodes. shortly all do1, do2, do3 go down shortly. Now the replicas are no longer available. HDFS multiple replicas assumes the probability of three nodes having same replicas going down altogether in a short time is low. Given that not sure if this problem is important enough. Alternatively, given block placement policy is pluggable, you could write a custom implementation and not change the default implementation?
          Hide
          Suresh Srinivas added a comment -

          Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN.

          Konstantin, I am not sure how likely do1, do2 and do3 all go down with in 10 minutes. Is it because these are flaky nodes?

          Show
          Suresh Srinivas added a comment - Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN. Konstantin, I am not sure how likely do1, do2 and do3 all go down with in 10 minutes. Is it because these are flaky nodes?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12526469/blockDeletePolicy-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2421//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2421//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12526469/blockDeletePolicy-trunk.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2421//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2421//console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          Submitting patch for trunk.

          Show
          Konstantin Shvachko added a comment - Submitting patch for trunk.
          Hide
          Konstantin Shvachko added a comment -

          I end up using 4 as a multiplier for heartbeatInterval. Looked at my busy but healthy cluster. There are always some nodes with last heartbeat around 10. So multiplier 4 should cover that.
          If there are nodes that are permanently late with heartbeats, then this policy will eventually reduce the block count on such nodes, which will reduce load on them, and potentially help with heartbeats.

          Show
          Konstantin Shvachko added a comment - I end up using 4 as a multiplier for heartbeatInterval. Looked at my busy but healthy cluster. There are always some nodes with last heartbeat around 10. So multiplier 4 should cover that. If there are nodes that are permanently late with heartbeats, then this policy will eventually reduce the block count on such nodes, which will reduce load on them, and potentially help with heartbeats.
          Hide
          Todd Lipcon added a comment -

          Todd, if you want to double "minus heartbeat interval", please get some motivation for that. I mean why not triple or 1.5 x.

          Sure, either of those seem fine too. Just meant to express that using the heartbeat interval itself may not be right, since the actual receipt of the heart beats may jitter a bit around the configured value (eg if a DN has to make a block report, or otherwise gets held up by a couple seconds).

          Show
          Todd Lipcon added a comment - Todd, if you want to double "minus heartbeat interval", please get some motivation for that. I mean why not triple or 1.5 x. Sure, either of those seem fine too. Just meant to express that using the heartbeat interval itself may not be right, since the actual receipt of the heart beats may jitter a bit around the configured value (eg if a DN has to make a block report, or otherwise gets held up by a couple seconds).
          Hide
          Konstantin Shvachko added a comment -

          I am working on a unit test. This is for a preview of the change.
          Nicholas, yes if the blocks can be manually copied from flaky nodes, then the data is not completely lost.
          Todd, if you want to double "minus heartbeat interval", please get some motivation for that. I mean why not triple or 1.5 x.

          Show
          Konstantin Shvachko added a comment - I am working on a unit test. This is for a preview of the change. Nicholas, yes if the blocks can be manually copied from flaky nodes, then the data is not completely lost. Todd, if you want to double "minus heartbeat interval", please get some motivation for that. I mean why not triple or 1.5 x.
          Hide
          Todd Lipcon added a comment -

          first look at the oldest heartbeat time, and second at the free space, when all heartbeats are within the heartbeat interval

          Seems we need some kind of slop factor on the heartbeat interval, right? eg if a DN is within 2x the heartbeat interval then it's considered "probably alive" for this policy.

          Show
          Todd Lipcon added a comment - first look at the oldest heartbeat time, and second at the free space, when all heartbeats are within the heartbeat interval Seems we need some kind of slop factor on the heartbeat interval, right? eg if a DN is within 2x the heartbeat interval then it's considered "probably alive" for this policy.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Konstantin,

          You proposal sounds good.

          BTW, I beg you could manually copy the blocks from do1, do2 and do3. Is it case? If yes, then it is actually data unavailability but not data loss.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Konstantin, You proposal sounds good. BTW, I beg you could manually copy the blocks from do1, do2 and do3. Is it case? If yes, then it is actually data unavailability but not data loss.
          Hide
          Konstantin Shvachko added a comment -

          I propose to adjust BlockPlacementPolicyDefault.chooseReplicaToDelete() to first look at the oldest heartbeat time, and second at the free space, when all heartbeats are within the heartbeat interval.
          With such policy in the scenario above the replicas for deletion are most likely to be assigned to do1, do2, do3, but will never be deleted, because the old nodes have already died. NN will automatically remove replicas from the live ones 10 minutes later or so.
          Also when only one or two DNs malfunction in the similar scenario this will reduce unnecessary deletions and replications.
          No change in policy will be seen in regular case when all nodes function properly.

          Show
          Konstantin Shvachko added a comment - I propose to adjust BlockPlacementPolicyDefault.chooseReplicaToDelete() to first look at the oldest heartbeat time, and second at the free space, when all heartbeats are within the heartbeat interval. With such policy in the scenario above the replicas for deletion are most likely to be assigned to do1, do2, do3, but will never be deleted, because the old nodes have already died. NN will automatically remove replicas from the live ones 10 minutes later or so. Also when only one or two DNs malfunction in the similar scenario this will reduce unnecessary deletions and replications. No change in policy will be seen in regular case when all nodes function properly.
          Hide
          Konstantin Shvachko added a comment - - edited
          • A block b has 3 replicas initially located on DNs do1, do2, do3.
          • At different times all three nodes malfunctioned and died, causing the replicas to migrate to dn1, dn2, dn3.
          • do1, do2, do3 were not added to the exclude list.
            And when the cluster restarts do1, do2, do3 are brought up along with dn1, dn2, dn3.
          • NN sees 6 replicas for block b and correctly decides to remove 3 of them.
            BlockPlacementPolicyDefault.chooseReplicaToDelete() selects three targets for deletion based on the free space remaining on DNs deemed to posses replicas.
            dn1, dn2, dn3 are most likely to be the targets for replica deletion, because they have been on the cluster longer than do1, do2, do3 and therefore are likely to have less free space.
          • Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN.
          • It will take 10 minutes for NN to recognize the fact that do1, do2, do3 are dead. By that time replicas will be removed from the good nodes, resulting in data loss.

          This is the real story seen in production.
          I verified that all major version are affected.

          Show
          Konstantin Shvachko added a comment - - edited A block b has 3 replicas initially located on DNs do1, do2, do3. At different times all three nodes malfunctioned and died, causing the replicas to migrate to dn1, dn2, dn3. do1, do2, do3 were not added to the exclude list. And when the cluster restarts do1, do2, do3 are brought up along with dn1, dn2, dn3. NN sees 6 replicas for block b and correctly decides to remove 3 of them. BlockPlacementPolicyDefault.chooseReplicaToDelete() selects three targets for deletion based on the free space remaining on DNs deemed to posses replicas. dn1, dn2, dn3 are most likely to be the targets for replica deletion, because they have been on the cluster longer than do1, do2, do3 and therefore are likely to have less free space. Expectedly do1, do2, do3 malfunction again and go down shortly after reporting their blocks to NN. It will take 10 minutes for NN to recognize the fact that do1, do2, do3 are dead. By that time replicas will be removed from the good nodes, resulting in data loss. This is the real story seen in production. I verified that all major version are affected.

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development