Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11609

Some blocks can be permanently lost if nodes are decommissioned while dead

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When all the nodes containing a replica of a block are decommissioned while they are dead, they get decommissioned right away even if there are missing blocks. This behavior was introduced by HDFS-7374.

      The problem starts when those decommissioned nodes are brought back online. The namenode no longer shows missing blocks, which creates a false sense of cluster health. When the decommissioned nodes are removed and reformatted, the block data is permanently lost. The namenode will report missing blocks after the heartbeat recheck interval (e.g. 10 minutes) from the moment the last node is taken down.

      There are multiple issues in the code. As some cause different behaviors in testing vs. production, it took a while to reproduce it in a unit test. I will present analysis and proposal soon.

      1. HDFS-11609_v2.branch-2.patch
        10 kB
        Kihwal Lee
      2. HDFS-11609_v2.trunk.patch
        10 kB
        Kihwal Lee
      3. HDFS-11609_v3.branch-2.7.patch
        10 kB
        Kihwal Lee
      4. HDFS-11609_v3.branch-2.patch
        10 kB
        Kihwal Lee
      5. HDFS-11609_v3.trunk.patch
        11 kB
        Kihwal Lee
      6. HDFS-11609.branch-2.patch
        9 kB
        Kihwal Lee
      7. HDFS-11609.trunk.patch
        10 kB
        Kihwal Lee

        Issue Links

          Activity

          Hide
          kihwal Kihwal Lee added a comment -

          Inability to correctly guess the previous replication priority

          Guessing the previous replication priority level of a block works most of times, but is not perfect. Different orders of events can lead to the identical current state, but the previous priority levels can differ. We can improve the priority update method so that the guessing logic still provides benefit in majority of cases, yet its correctness is not strictly necessary.

          Following shows problems I encountered and solutions.

          In UnderReplicatedBlocks,

            private int getPriority(int curReplicas,
                                    int readOnlyReplicas,
                                    int decommissionedReplicas,
                                    int expectedReplicas) {
              assert curReplicas >= 0 : "Negative replicas!";
          

          This is called from update(), which calls it with curReplicas set to curReplicas-curReplicasDelta. When all replica-containing nodes are dead (curReplicas is 0), but a decommissioned node having a replica joins, update() is called with curReplicas of -1, which sets off the assert. This causes initial block report processing to stop in the middle. This node is live and decommissioned and the block will appear missing because the block report wasn't processed due to the assertion failure.

          This can be avoided if curReplicasDelta is not set to 1 if this replica is decommissioned. This value originates from BlockManager's addStoredBlock().

               if (result == AddBlockResult.ADDED) {
          -      curReplicaDelta = 1;
          +      curReplicaDelta = (node.isDecommissioned()) ? 0 : 1;
          

          This fixes this particular issue.

          The assert is removed in the real build, so it acts differently in production runtime. Instead block report processing blowing up, "-1" will cause it to return the level, QUEUE_VERY_UNDER_REPLICATED without the above fix, which is incorrect.

          If the previous priority level is guessed incorrectly and it happens to be identical to the current level, the old entry won't be removed, resulting in duplicate entries. The remove() method is already robust so if a block is not found in the specified level, it tries to remove it from other priority levels too. So we can simply call remove() unconditionally. Guessing the old priority is not functionally necessary with this change, but is still useful, since the guess is normally correct which makes it visit only one priority level for removal in most of cases.

          -    if(oldPri != curPri) {
          -      remove(block, oldPri);
          -    }
          +    // oldPri is mostly correct, but not always. If not found with oldPri,
          +    // other levels will be searched until the block is found & removed.
          +    remove(block, oldPri);
          

          Replication priority level of a block with only decommissioned replicas

          With the surrounding bugs fixed, now we can address the real issue. getPriority() explicitly does this:

              } else if (curReplicas == 0) {
                // If there are zero non-decommissioned replicas but there are
                // some decommissioned replicas, then assign them highest priority
                if (decommissionedReplicas > 0) {
                  return QUEUE_HIGHEST_PRIORITY;
                }
          

          This does not make any sense. Since decommissioned nodes are never chose as a replication source, the block cannot be re-replicated. Being at this priority, the block won't be recognized as "missing" either. It will appear that the cluster is healthy until the decommissioned nodes are taken down, at which point it might be too late to recover the data.

          There are several possible approaches to this.
          1) If all it has is decommissioned replicas, show it as missing. I.e. priority level of QUEUE_WITH_CORRUPT_BLOCKS. fsck will show the decommissioned locations and the admin can recommission/decommission or manually copy the data out.
          2) Re-evaluate all replicas when a decommissioned node rejoins. The simplest way is to start decommissioning the node again.
          3) Allow a decommissioned replica to be picked as a replication source in this special case. 1) might still be needed.

          I have a patch with 1) and a unit test, but want to hear from others before posting.

          Show
          kihwal Kihwal Lee added a comment - Inability to correctly guess the previous replication priority Guessing the previous replication priority level of a block works most of times, but is not perfect. Different orders of events can lead to the identical current state, but the previous priority levels can differ. We can improve the priority update method so that the guessing logic still provides benefit in majority of cases, yet its correctness is not strictly necessary. Following shows problems I encountered and solutions. In UnderReplicatedBlocks , private int getPriority( int curReplicas, int readOnlyReplicas, int decommissionedReplicas, int expectedReplicas) { assert curReplicas >= 0 : "Negative replicas!" ; This is called from update() , which calls it with curReplicas set to curReplicas-curReplicasDelta . When all replica-containing nodes are dead ( curReplicas is 0), but a decommissioned node having a replica joins, update() is called with curReplicas of -1, which sets off the assert. This causes initial block report processing to stop in the middle. This node is live and decommissioned and the block will appear missing because the block report wasn't processed due to the assertion failure. This can be avoided if curReplicasDelta is not set to 1 if this replica is decommissioned. This value originates from BlockManager 's addStoredBlock() . if (result == AddBlockResult.ADDED) { - curReplicaDelta = 1; + curReplicaDelta = (node.isDecommissioned()) ? 0 : 1; This fixes this particular issue. The assert is removed in the real build, so it acts differently in production runtime. Instead block report processing blowing up, "-1" will cause it to return the level, QUEUE_VERY_UNDER_REPLICATED without the above fix, which is incorrect. If the previous priority level is guessed incorrectly and it happens to be identical to the current level, the old entry won't be removed, resulting in duplicate entries. The remove() method is already robust so if a block is not found in the specified level, it tries to remove it from other priority levels too. So we can simply call remove() unconditionally. Guessing the old priority is not functionally necessary with this change, but is still useful, since the guess is normally correct which makes it visit only one priority level for removal in most of cases. - if (oldPri != curPri) { - remove(block, oldPri); - } + // oldPri is mostly correct, but not always. If not found with oldPri, + // other levels will be searched until the block is found & removed. + remove(block, oldPri); Replication priority level of a block with only decommissioned replicas With the surrounding bugs fixed, now we can address the real issue. getPriority() explicitly does this: } else if (curReplicas == 0) { // If there are zero non-decommissioned replicas but there are // some decommissioned replicas, then assign them highest priority if (decommissionedReplicas > 0) { return QUEUE_HIGHEST_PRIORITY; } This does not make any sense. Since decommissioned nodes are never chose as a replication source, the block cannot be re-replicated. Being at this priority, the block won't be recognized as "missing" either. It will appear that the cluster is healthy until the decommissioned nodes are taken down, at which point it might be too late to recover the data. There are several possible approaches to this. 1) If all it has is decommissioned replicas, show it as missing. I.e. priority level of QUEUE_WITH_CORRUPT_BLOCKS . fsck will show the decommissioned locations and the admin can recommission/decommission or manually copy the data out. 2) Re-evaluate all replicas when a decommissioned node rejoins. The simplest way is to start decommissioning the node again. 3) Allow a decommissioned replica to be picked as a replication source in this special case. 1) might still be needed. I have a patch with 1) and a unit test, but want to hear from others before posting.
          Hide
          kihwal Kihwal Lee added a comment -

          I have a patch with 1) and a unit test, but want to hear from others before posting.

          I thought it is acceptable to simply let admins know something is wrong (instead of pretending everything is fine), but then the recovery would be hard for many people, especially if it involves many nodes and blocks. So it might be better to automatically recover as much as possible. The version of patch I am attaching now considers a decommissioned node as a replication source if there is no live node. I.e. implementing the option 3). 1) is not done here, as it makes replication not happen.

          Show
          kihwal Kihwal Lee added a comment - I have a patch with 1) and a unit test, but want to hear from others before posting. I thought it is acceptable to simply let admins know something is wrong (instead of pretending everything is fine), but then the recovery would be hard for many people, especially if it involves many nodes and blocks. So it might be better to automatically recover as much as possible. The version of patch I am attaching now considers a decommissioned node as a replication source if there is no live node. I.e. implementing the option 3). 1) is not done here, as it makes replication not happen.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 54s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 13m 35s trunk passed
          +1 compile 0m 45s trunk passed
          +1 checkstyle 0m 38s trunk passed
          +1 mvnsite 0m 50s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 45s trunk passed
          +1 javadoc 0m 40s trunk passed
          +1 mvninstall 0m 45s the patch passed
          +1 compile 0m 43s the patch passed
          +1 javac 0m 43s the patch passed
          -0 checkstyle 0m 34s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 149 unchanged - 0 fixed = 150 total (was 149)
          +1 mvnsite 0m 48s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 49s the patch passed
          +1 javadoc 0m 37s the patch passed
          -1 unit 64m 40s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          91m 1s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
            hadoop.hdfs.TestCrcCorruption



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-11609
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861770/HDFS-11609.trunk.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 5de4da3da0ea 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 5faa949
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18960/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18960/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18960/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18960/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 54s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 13m 35s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 38s trunk passed +1 mvnsite 0m 50s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 45s trunk passed +1 javadoc 0m 40s trunk passed +1 mvninstall 0m 45s the patch passed +1 compile 0m 43s the patch passed +1 javac 0m 43s the patch passed -0 checkstyle 0m 34s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 149 unchanged - 0 fixed = 150 total (was 149) +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 49s the patch passed +1 javadoc 0m 37s the patch passed -1 unit 64m 40s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 91m 1s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.TestCrcCorruption Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-11609 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861770/HDFS-11609.trunk.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 5de4da3da0ea 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 5faa949 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18960/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/18960/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18960/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18960/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Attaching branch-2 version of the patch, as the trunk version was a phenomenal success.

          Show
          kihwal Kihwal Lee added a comment - Attaching branch-2 version of the patch, as the trunk version was a phenomenal success.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hello Kihwal Lee thanks for posting the patch and very nice analysis. I am still reviewing the patch. Could you update the comments in the code as well?
          For example,

                 // never use already decommissioned nodes, maintenance node not
                 // suitable for read or unknown state replicas.
          -      if (state == null || state == StoredReplicaState.DECOMMISSIONED
          -          || state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) {
          +      if (state == null ||
          +          state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) {
          

          I think "already decommissioned nodes," should be removed from the comment.

          It would also be nice if you could add comments next to the comment added in BlockManager explaining what it is meant to do.

          Thanks!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hello Kihwal Lee thanks for posting the patch and very nice analysis. I am still reviewing the patch. Could you update the comments in the code as well? For example, // never use already decommissioned nodes, maintenance node not // suitable for read or unknown state replicas. - if (state == null || state == StoredReplicaState.DECOMMISSIONED - || state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) { + if (state == null || + state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) { I think "already decommissioned nodes," should be removed from the comment. It would also be nice if you could add comments next to the comment added in BlockManager explaining what it is meant to do. Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 14m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 9m 18s branch-2 passed
          +1 compile 0m 39s branch-2 passed with JDK v1.8.0_121
          +1 compile 0m 43s branch-2 passed with JDK v1.7.0_121
          +1 checkstyle 0m 30s branch-2 passed
          +1 mvnsite 0m 52s branch-2 passed
          +1 mvneclipse 0m 17s branch-2 passed
          +1 findbugs 1m 58s branch-2 passed
          +1 javadoc 0m 56s branch-2 passed with JDK v1.8.0_121
          +1 javadoc 1m 35s branch-2 passed with JDK v1.7.0_121
          +1 mvninstall 0m 43s the patch passed
          +1 compile 0m 36s the patch passed with JDK v1.8.0_121
          +1 javac 0m 36s the patch passed
          +1 compile 0m 40s the patch passed with JDK v1.7.0_121
          +1 javac 0m 40s the patch passed
          -0 checkstyle 0m 26s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 184 unchanged - 1 fixed = 185 total (was 185)
          +1 mvnsite 0m 48s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 6s the patch passed
          +1 javadoc 0m 50s the patch passed with JDK v1.8.0_121
          +1 javadoc 1m 33s the patch passed with JDK v1.7.0_121
          -1 unit 48m 26s hadoop-hdfs in the patch failed with JDK v1.7.0_121.
          -1 asflicense 0m 20s The patch generated 1 ASF License warnings.
          145m 13s



          Reason Tests
          JDK v1.8.0_121 Failed junit tests hadoop.hdfs.web.TestWebHDFS
          JDK v1.8.0_121 Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
          JDK v1.7.0_121 Failed junit tests hadoop.hdfs.server.datanode.metrics.TestDataNodeOutlierDetectionViaMetrics
            hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs
            hadoop.hdfs.server.namenode.TestFSImageWithXAttr
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:b59b8b7
          JIRA Issue HDFS-11609
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861899/HDFS-11609.branch-2.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ba5ae41e22cc 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2 / 8c21b2a
          Default Java 1.7.0_121
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_121 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18969/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18969/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_121.txt
          JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18969/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/18969/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18969/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 14m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 9m 18s branch-2 passed +1 compile 0m 39s branch-2 passed with JDK v1.8.0_121 +1 compile 0m 43s branch-2 passed with JDK v1.7.0_121 +1 checkstyle 0m 30s branch-2 passed +1 mvnsite 0m 52s branch-2 passed +1 mvneclipse 0m 17s branch-2 passed +1 findbugs 1m 58s branch-2 passed +1 javadoc 0m 56s branch-2 passed with JDK v1.8.0_121 +1 javadoc 1m 35s branch-2 passed with JDK v1.7.0_121 +1 mvninstall 0m 43s the patch passed +1 compile 0m 36s the patch passed with JDK v1.8.0_121 +1 javac 0m 36s the patch passed +1 compile 0m 40s the patch passed with JDK v1.7.0_121 +1 javac 0m 40s the patch passed -0 checkstyle 0m 26s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 184 unchanged - 1 fixed = 185 total (was 185) +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 6s the patch passed +1 javadoc 0m 50s the patch passed with JDK v1.8.0_121 +1 javadoc 1m 33s the patch passed with JDK v1.7.0_121 -1 unit 48m 26s hadoop-hdfs in the patch failed with JDK v1.7.0_121. -1 asflicense 0m 20s The patch generated 1 ASF License warnings. 145m 13s Reason Tests JDK v1.8.0_121 Failed junit tests hadoop.hdfs.web.TestWebHDFS JDK v1.8.0_121 Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting JDK v1.7.0_121 Failed junit tests hadoop.hdfs.server.datanode.metrics.TestDataNodeOutlierDetectionViaMetrics   hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs   hadoop.hdfs.server.namenode.TestFSImageWithXAttr   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HDFS-11609 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12861899/HDFS-11609.branch-2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ba5ae41e22cc 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 8c21b2a Default Java 1.7.0_121 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_121 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18969/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/18969/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_121.txt JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18969/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/18969/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18969/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hello Kihwal Lee thanks for posting the patch and very nice analysis. I am still reviewing the patch. Could you update the comments in the code as well?
          For example,

                 // never use already decommissioned nodes, maintenance node not
                 // suitable for read or unknown state replicas.
          -      if (state == null || state == StoredReplicaState.DECOMMISSIONED
          -          || state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) {
          +      if (state == null ||
          +          state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) {
          

          I think "already decommissioned nodes," should be removed from the comment.

          It would also be nice if you could add comments next to the comment added in BlockManager explaining what it is meant to do.

          Thanks!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hello Kihwal Lee thanks for posting the patch and very nice analysis. I am still reviewing the patch. Could you update the comments in the code as well? For example, // never use already decommissioned nodes, maintenance node not // suitable for read or unknown state replicas. - if (state == null || state == StoredReplicaState.DECOMMISSIONED - || state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) { + if (state == null || + state == StoredReplicaState.MAINTENANCE_NOT_FOR_READ) { I think "already decommissioned nodes," should be removed from the comment. It would also be nice if you could add comments next to the comment added in BlockManager explaining what it is meant to do. Thanks!
          Hide
          djp Junping Du added a comment -

          Sounds like a blocker for 2.8.1. Kihwal Lee and Wei-Chiu Chuang, what do you guys think?

          Show
          djp Junping Du added a comment - Sounds like a blocker for 2.8.1. Kihwal Lee and Wei-Chiu Chuang , what do you guys think?
          Hide
          kihwal Kihwal Lee added a comment -

          Agree it is a blocker. I will update the patch today.

          Show
          kihwal Kihwal Lee added a comment - Agree it is a blocker. I will update the patch today.
          Hide
          kihwal Kihwal Lee added a comment -

          Attaching the updated patches.

          Show
          kihwal Kihwal Lee added a comment - Attaching the updated patches.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 13m 31s trunk passed
          +1 compile 0m 58s trunk passed
          +1 checkstyle 0m 41s trunk passed
          +1 mvnsite 1m 11s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 2m 6s trunk passed
          +1 javadoc 0m 42s trunk passed
          +1 mvninstall 1m 0s the patch passed
          +1 compile 0m 45s the patch passed
          +1 javac 0m 45s the patch passed
          -0 checkstyle 0m 34s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 150 unchanged - 0 fixed = 151 total (was 150)
          +1 mvnsite 0m 49s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 49s the patch passed
          +1 javadoc 0m 38s the patch passed
          -1 unit 64m 15s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          91m 16s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
            hadoop.hdfs.server.balancer.TestBalancer
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ac17dc
          JIRA Issue HDFS-11609
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12863841/HDFS-11609_v2.trunk.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 6d4e836a1d74 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 654372d
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19127/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19127/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19127/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19127/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 13m 31s trunk passed +1 compile 0m 58s trunk passed +1 checkstyle 0m 41s trunk passed +1 mvnsite 1m 11s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 2m 6s trunk passed +1 javadoc 0m 42s trunk passed +1 mvninstall 1m 0s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed -0 checkstyle 0m 34s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 150 unchanged - 0 fixed = 151 total (was 150) +1 mvnsite 0m 49s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 49s the patch passed +1 javadoc 0m 38s the patch passed -1 unit 64m 15s hadoop-hdfs in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 91m 16s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestPipelinesFailover   hadoop.hdfs.server.balancer.TestBalancer   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Subsystem Report/Notes Docker Image:yetus/hadoop:0ac17dc JIRA Issue HDFS-11609 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12863841/HDFS-11609_v2.trunk.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6d4e836a1d74 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 654372d Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19127/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19127/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19127/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19127/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Unit test failures are not related to the patch. I ran all of them successfully.

          Show
          kihwal Kihwal Lee added a comment - Unit test failures are not related to the patch. I ran all of them successfully.
          Hide
          kihwal Kihwal Lee added a comment -

          Wei-Chiu Chuang, would you look at the updated patch?

          Show
          kihwal Kihwal Lee added a comment - Wei-Chiu Chuang , would you look at the updated patch?
          Hide
          daryn Daryn Sharp added a comment -

          +1 Pending updating the comment "We do not use already decommissioned nodes as a source" to mention as a last resort.

          Show
          daryn Daryn Sharp added a comment - +1 Pending updating the comment "We do not use already decommissioned nodes as a source" to mention as a last resort.
          Hide
          kihwal Kihwal Lee added a comment -

          Attaching patches with the comment corrected. The code changes are identical to the v2.

          Show
          kihwal Kihwal Lee added a comment - Attaching patches with the comment corrected. The code changes are identical to the v2.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 15m 6s trunk passed
          +1 compile 0m 52s trunk passed
          +1 checkstyle 0m 40s trunk passed
          +1 mvnsite 1m 2s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          -1 findbugs 1m 47s hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings.
          +1 javadoc 0m 46s trunk passed
          +1 mvninstall 1m 0s the patch passed
          +1 compile 0m 50s the patch passed
          +1 javac 0m 50s the patch passed
          -0 checkstyle 0m 39s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 149 unchanged - 0 fixed = 150 total (was 149)
          +1 mvnsite 1m 0s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 53s the patch passed
          +1 javadoc 0m 41s the patch passed
          -1 unit 78m 57s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          107m 48s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestEncryptionZones
            hadoop.hdfs.server.balancer.TestBalancer
            hadoop.hdfs.server.namenode.TestMetadataVersionOutput
            hadoop.hdfs.TestMaintenanceState
            hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
            hadoop.hdfs.server.namenode.TestStartup



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ac17dc
          JIRA Issue HDFS-11609
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12865583/HDFS-11609_v3.trunk.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 003c4dcfd7fd 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 64f68cb
          Default Java 1.8.0_121
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19248/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19248/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19248/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19248/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19248/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 15m 6s trunk passed +1 compile 0m 52s trunk passed +1 checkstyle 0m 40s trunk passed +1 mvnsite 1m 2s trunk passed +1 mvneclipse 0m 15s trunk passed -1 findbugs 1m 47s hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. +1 javadoc 0m 46s trunk passed +1 mvninstall 1m 0s the patch passed +1 compile 0m 50s the patch passed +1 javac 0m 50s the patch passed -0 checkstyle 0m 39s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 149 unchanged - 0 fixed = 150 total (was 149) +1 mvnsite 1m 0s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 53s the patch passed +1 javadoc 0m 41s the patch passed -1 unit 78m 57s hadoop-hdfs in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 107m 48s Reason Tests Failed junit tests hadoop.hdfs.TestEncryptionZones   hadoop.hdfs.server.balancer.TestBalancer   hadoop.hdfs.server.namenode.TestMetadataVersionOutput   hadoop.hdfs.TestMaintenanceState   hadoop.hdfs.server.namenode.ha.TestPipelinesFailover   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting   hadoop.hdfs.server.namenode.TestStartup Subsystem Report/Notes Docker Image:yetus/hadoop:0ac17dc JIRA Issue HDFS-11609 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12865583/HDFS-11609_v3.trunk.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 003c4dcfd7fd 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 64f68cb Default Java 1.8.0_121 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19248/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19248/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19248/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19248/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19248/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Only updated the comment in the new patch. Test failures are not related. Reran all failed test cases. TestDataNodeVolumeFailureReporting times out occasionally. Probably HDFS-11488. The rest are being fixed in HADOOP-14368.

          Show
          kihwal Kihwal Lee added a comment - Only updated the comment in the new patch. Test failures are not related. Reran all failed test cases. TestDataNodeVolumeFailureReporting times out occasionally. Probably HDFS-11488 . The rest are being fixed in HADOOP-14368 .
          Hide
          daryn Daryn Sharp added a comment -

          +1 Test seems to pass for me too so maybe it was a fluke.

          Show
          daryn Daryn Sharp added a comment - +1 Test seems to pass for me too so maybe it was a fluke.
          Hide
          kihwal Kihwal Lee added a comment -

          Committed to trunk through branch-2.7.

          Show
          kihwal Kihwal Lee added a comment - Committed to trunk through branch-2.7.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11661 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11661/)
          HDFS-11609. Some blocks can be permanently lost if nodes are (kihwal: rev 07b98e7830c2214340cb7f434df674057e89df94)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/LowRedundancyBlocks.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11661 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11661/ ) HDFS-11609 . Some blocks can be permanently lost if nodes are (kihwal: rev 07b98e7830c2214340cb7f434df674057e89df94) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/LowRedundancyBlocks.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - 2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

            People

            • Assignee:
              kihwal Kihwal Lee
              Reporter:
              kihwal Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development