Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11499

Decommissioning stuck because of failing recovery

    Details

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Allow a block to complete if the number of replicas on live nodes, decommissioning nodes and nodes in maintenance mode satisfies minimum replication factor.
      The fix prevents block recovery failure if replica of last block is being decommissioned. Vice versa, the decommissioning will be stuck, waiting for the last block to be completed. In addition, file close() operation will not fail due to last block being decommissioned.
      Show
      Allow a block to complete if the number of replicas on live nodes, decommissioning nodes and nodes in maintenance mode satisfies minimum replication factor. The fix prevents block recovery failure if replica of last block is being decommissioned. Vice versa, the decommissioning will be stuck, waiting for the last block to be completed. In addition, file close() operation will not fail due to last block being decommissioned.

      Description

      Block recovery will fail to finalize the file if the locations of the last, incomplete block are being decommissioned. Vice versa, the decommissioning will be stuck, waiting for the last block to be completed.

      org.apache.hadoop.ipc.RemoteException(java.lang.IllegalStateException): Failed to finalize INodeFile testRecoveryFile since blocks[255] is non-complete, where blocks=[blk_1073741825_1001, blk_1073741826_1002...
      

      The fix is to count replicas on decommissioning nodes when completing last block in BlockManager.commitOrCompleteLastBlock, as we know that the DecommissionManager will not decommission a node that has UC blocks.

      1. HDFS-11499.02.patch
        8 kB
        Manoj Govindassamy
      2. HDFS-11499.03.patch
        8 kB
        Manoj Govindassamy
      3. HDFS-11499.04.patch
        8 kB
        Lukas Majercak
      4. HDFS-11499.05.patch
        7 kB
        Lukas Majercak
      5. HDFS-11499.branch-2.7.patch
        3 kB
        Wei-Chiu Chuang
      6. HDFS-11499.branch-2.8.patch
        3 kB
        Wei-Chiu Chuang
      7. HDFS-11499.patch
        4 kB
        Lukas Majercak

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user lukmajercak opened a pull request:

          https://github.com/apache/hadoop/pull/199

          HDFS-11499 Decommissioning stuck because of failing recovery

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/lukmajercak/hadoop HDFS-11499

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/hadoop/pull/199.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #199


          commit 3609b1353e64a24dee4746b8fa23ed7547768d68
          Author: Lukas Majercak <lumajerc@microsoft.com>
          Date: 2017-03-05T20:04:06Z

          HDFS-11499 add TestDecommission.testDecommissionWithOpenFileAndDatanodeFailing for testing recovery

          commit 3f97d89f75d8a20f878da8c438141f9b6adf7da0
          Author: Lukas Majercak <lumajerc@microsoft.com>
          Date: 2017-03-05T20:05:08Z

          HDFS-11499 count decommissioning replicas when completing last block in BlockManager.commitOrCompleteLastBlock


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user lukmajercak opened a pull request: https://github.com/apache/hadoop/pull/199 HDFS-11499 Decommissioning stuck because of failing recovery You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukmajercak/hadoop HDFS-11499 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hadoop/pull/199.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #199 commit 3609b1353e64a24dee4746b8fa23ed7547768d68 Author: Lukas Majercak <lumajerc@microsoft.com> Date: 2017-03-05T20:04:06Z HDFS-11499 add TestDecommission.testDecommissionWithOpenFileAndDatanodeFailing for testing recovery commit 3f97d89f75d8a20f878da8c438141f9b6adf7da0 Author: Lukas Majercak <lumajerc@microsoft.com> Date: 2017-03-05T20:05:08Z HDFS-11499 count decommissioning replicas when completing last block in BlockManager.commitOrCompleteLastBlock
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Hi Lukas,
          this issue sounds related to HDFS-11486. The block recovery failure here and the client close failure in HDFS-11486 are both caused by the same buggy check.

          Can you update the status of this jira to Patch available to trigger precommit check? Thanks!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Hi Lukas, this issue sounds related to HDFS-11486 . The block recovery failure here and the client close failure in HDFS-11486 are both caused by the same buggy check. Can you update the status of this jira to Patch available to trigger precommit check? Thanks!
          Hide
          elgoiri Íñigo Goiri added a comment -

          The fix seems correct to me and the unit test seems good enough.
          I think the other uses of hasMinStorage() are correct.
          It seems this was introduced in HDFS-1172 a couple years ago; Todd Lipcon, Masatake Iwasaki, do you mind taking a look?

          Show
          elgoiri Íñigo Goiri added a comment - The fix seems correct to me and the unit test seems good enough. I think the other uses of hasMinStorage() are correct. It seems this was introduced in HDFS-1172 a couple years ago; Todd Lipcon , Masatake Iwasaki , do you mind taking a look?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          HI Lukas Majercak
          thanks for the patch. I quickly reviewed it and I think it did the right thing. The test looks good too. In addition, it also fixes the close() failure bug I described in HDFS-11486.

          • HDFS now implements a new replica state called maintenance mode. It seems that case is not being considered in the patch. Would it make sense to also fix the same issue with maintenance mode? Manoj Govindassamy how do you feel?
          Show
          jojochuang Wei-Chiu Chuang added a comment - HI Lukas Majercak thanks for the patch. I quickly reviewed it and I think it did the right thing. The test looks good too. In addition, it also fixes the close() failure bug I described in HDFS-11486 . HDFS now implements a new replica state called maintenance mode. It seems that case is not being considered in the patch. Would it make sense to also fix the same issue with maintenance mode? Manoj Govindassamy how do you feel?
          Hide
          manojg Manoj Govindassamy added a comment -

          Sure Wei-Chiu Chuang. I can give you a patch for the same, soon.

          Show
          manojg Manoj Govindassamy added a comment - Sure Wei-Chiu Chuang . I can give you a patch for the same, soon.
          Hide
          lukmajercak Lukas Majercak added a comment -

          Hi Wei-Chiu Chuang, Manoj Govindassamy

          So shall we count in the replicas on maintenance nodes as well? Can we need to add a test case to cover this/modify the one in the patch?

          Thanks

          Show
          lukmajercak Lukas Majercak added a comment - Hi Wei-Chiu Chuang , Manoj Govindassamy So shall we count in the replicas on maintenance nodes as well? Can we need to add a test case to cover this/modify the one in the patch? Thanks
          Hide
          manojg Manoj Govindassamy added a comment -

          yes Lukas Majercak, we need to additionally count in the ENTERING MAINTENANCE nodes as well. I am adding a new test based on the one given by Yiqun Lin in HDFS-11486 in TestMaintenanceState to cover this case.

          Lukas Majercak/Wei-Chiu Chuang, Shall I merge all the fix and test patches along with mine and post a complete patch covering both HDFS-11499 and HDFS-11486 ? Or shall I submit v02 patch for this HDFS-11499 alone with maintenance state included. Your suggestion please ?

          Show
          manojg Manoj Govindassamy added a comment - yes Lukas Majercak , we need to additionally count in the ENTERING MAINTENANCE nodes as well. I am adding a new test based on the one given by Yiqun Lin in HDFS-11486 in TestMaintenanceState to cover this case. Lukas Majercak / Wei-Chiu Chuang , Shall I merge all the fix and test patches along with mine and post a complete patch covering both HDFS-11499 and HDFS-11486 ? Or shall I submit v02 patch for this HDFS-11499 alone with maintenance state included. Your suggestion please ?
          Hide
          lukmajercak Lukas Majercak added a comment -

          Manoj Govindassamy, I don't really mind, let's go for the second option? Also, + Ming Ma, do you want to check this? Seems like it is related to your HDFS-9390.

          Show
          lukmajercak Lukas Majercak added a comment - Manoj Govindassamy , I don't really mind, let's go for the second option? Also, + Ming Ma , do you want to check this? Seems like it is related to your HDFS-9390 .
          Hide
          manojg Manoj Govindassamy added a comment -

          Lukas Majercak, Wei-Chiu Chuang, Yiqun Lin,
          Attached v02 patch to address the following. Can you please take a look at the patch.

          • BlockManager#commitOrCompleteLastBlock() to consider entering_maintenance replicase along with decommissioning replicas for usable replicas.
          • Added unit test testFileCloseAfterEnteringMaintenance to TestMaintenanceState based on the test given by Yiqun. Without fix, the test fails at file close.
          Show
          manojg Manoj Govindassamy added a comment - Lukas Majercak , Wei-Chiu Chuang , Yiqun Lin , Attached v02 patch to address the following. Can you please take a look at the patch. BlockManager#commitOrCompleteLastBlock() to consider entering_maintenance replicase along with decommissioning replicas for usable replicas. Added unit test testFileCloseAfterEnteringMaintenance to TestMaintenanceState based on the test given by Yiqun. Without fix, the test fails at file close.
          Hide
          lukmajercak Lukas Majercak added a comment -

          LGTM

          Show
          lukmajercak Lukas Majercak added a comment - LGTM
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 33s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 19m 58s trunk passed
          +1 compile 1m 18s trunk passed
          +1 checkstyle 0m 58s trunk passed
          +1 mvnsite 1m 28s trunk passed
          +1 mvneclipse 0m 23s trunk passed
          +1 findbugs 2m 25s trunk passed
          +1 javadoc 1m 0s trunk passed
          +1 mvninstall 1m 11s the patch passed
          +1 compile 1m 2s the patch passed
          +1 javac 1m 2s the patch passed
          -0 checkstyle 0m 43s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 146 unchanged - 0 fixed = 150 total (was 146)
          +1 mvnsite 1m 10s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 16s the patch passed
          +1 javadoc 0m 50s the patch passed
          -1 unit 96m 49s hadoop-hdfs in the patch failed.
          +1 asflicense 1m 58s The patch does not generate ASF License warnings.
          136m 19s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
            hadoop.hdfs.TestFileAppend3
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
            hadoop.hdfs.TestDecommission
            hadoop.hdfs.server.datanode.TestDirectoryScanner
            hadoop.hdfs.server.datanode.TestDataNodeUUID
          Timed out junit tests org.apache.hadoop.hdfs.server.namenode.TestEditLog
            org.apache.hadoop.hdfs.server.namenode.TestQuotaByStorageType
            org.apache.hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean
            org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure030
            org.apache.hadoop.hdfs.server.namenode.TestEditLogAutoroll



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-11499
          GITHUB PR https://github.com/apache/hadoop/pull/199
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 278985d535e8 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d9dc444
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18598/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18598/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18598/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18598/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 33s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 19m 58s trunk passed +1 compile 1m 18s trunk passed +1 checkstyle 0m 58s trunk passed +1 mvnsite 1m 28s trunk passed +1 mvneclipse 0m 23s trunk passed +1 findbugs 2m 25s trunk passed +1 javadoc 1m 0s trunk passed +1 mvninstall 1m 11s the patch passed +1 compile 1m 2s the patch passed +1 javac 1m 2s the patch passed -0 checkstyle 0m 43s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 146 unchanged - 0 fixed = 150 total (was 146) +1 mvnsite 1m 10s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 16s the patch passed +1 javadoc 0m 50s the patch passed -1 unit 96m 49s hadoop-hdfs in the patch failed. +1 asflicense 1m 58s The patch does not generate ASF License warnings. 136m 19s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting   hadoop.hdfs.TestFileAppend3   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.TestDecommission   hadoop.hdfs.server.datanode.TestDirectoryScanner   hadoop.hdfs.server.datanode.TestDataNodeUUID Timed out junit tests org.apache.hadoop.hdfs.server.namenode.TestEditLog   org.apache.hadoop.hdfs.server.namenode.TestQuotaByStorageType   org.apache.hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean   org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure030   org.apache.hadoop.hdfs.server.namenode.TestEditLogAutoroll Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-11499 GITHUB PR https://github.com/apache/hadoop/pull/199 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 278985d535e8 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d9dc444 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18598/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/18598/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18598/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18598/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          lukmajercak Lukas Majercak added a comment -

          Looks like the test timed out, Manoj Govindassamy, do you mind increasing the timeout to 360sec (based on the other tests in TestDecommission) and submitting a patch? Thanks!

          Show
          lukmajercak Lukas Majercak added a comment - Looks like the test timed out, Manoj Govindassamy , do you mind increasing the timeout to 360sec (based on the other tests in TestDecommission) and submitting a patch? Thanks!
          Hide
          manojg Manoj Govindassamy added a comment -

          Lukas Majercak,
          Are you referring to the timeout in TestDecommission#testDecommissionWithOpenFileAndDatanodeFailing() which was part of the patch v01 ? In the patch v02 I added Maintenance State related test. Not sure, if extending the timeout for the failed test is going to solve the problem. Because, the nodes didn't move to DECOMMISSIONED state as the test is expecting .

          2017-03-06 23:33:49,462 [Thread-782] INFO  hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress
          2017-03-06 23:33:49,462 [Thread-782] INFO  hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress
          
          [test timeout]
          
          2017-03-06 23:33:49,486 [main] INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1951)) - Shutting down the Mini HDFS Cluster
          
          Show
          manojg Manoj Govindassamy added a comment - Lukas Majercak , Are you referring to the timeout in TestDecommission#testDecommissionWithOpenFileAndDatanodeFailing() which was part of the patch v01 ? In the patch v02 I added Maintenance State related test. Not sure, if extending the timeout for the failed test is going to solve the problem. Because, the nodes didn't move to DECOMMISSIONED state as the test is expecting . 2017-03-06 23:33:49,462 [Thread-782] INFO hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress 2017-03-06 23:33:49,462 [Thread-782] INFO hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:33069 to change state to Decommissioned current state: Decommission In Progress [test timeout] 2017-03-06 23:33:49,486 [main] INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1951)) - Shutting down the Mini HDFS Cluster
          Hide
          lukmajercak Lukas Majercak added a comment -

          Manoj Govindassamy, yes TestDecommission#testDecommissionWithOpenFileAndDatanodeFailing(). It will wait for all three DNs to be decommissioned right, and the log you showed is just 1 of them. I would suggest increasing it to 360sec yes. It finishes in 30seconds~ on my machine, similar to TestDecommission#testDeadNodeCountAfterNamenodeRestart, which has 360sec timeout.

          Show
          lukmajercak Lukas Majercak added a comment - Manoj Govindassamy , yes TestDecommission#testDecommissionWithOpenFileAndDatanodeFailing(). It will wait for all three DNs to be decommissioned right, and the log you showed is just 1 of them. I would suggest increasing it to 360sec yes. It finishes in 30seconds~ on my machine, similar to TestDecommission#testDeadNodeCountAfterNamenodeRestart, which has 360sec timeout.
          Hide
          manojg Manoj Govindassamy added a comment -

          Lukas Majercak,
          Attached v03 patch to have the same test timeout as the other tests and also fixed checkstyle issues. Please take a look.

          Show
          manojg Manoj Govindassamy added a comment - Lukas Majercak , Attached v03 patch to have the same test timeout as the other tests and also fixed checkstyle issues. Please take a look.
          Hide
          linyiqun Yiqun Lin added a comment -

          Great work, everyone!
          Only one nit for the v03 patch:

          +
          +    Path openFile = new Path("/testClosingFileInMaintenance.dat");
          +    // Lets write 2 blocks of data to the openFile
          +    writeFile(getCluster().getFileSystem(), openFile, (short) 3);
          +
          

          The comment seems not accurate, here we write three block replica to openFile , right? Manoj Govindassamy. Or this should be replaced by "two more blocks".

          Show
          linyiqun Yiqun Lin added a comment - Great work, everyone! Only one nit for the v03 patch: + + Path openFile = new Path( "/testClosingFileInMaintenance.dat" ); + // Lets write 2 blocks of data to the openFile + writeFile(getCluster().getFileSystem(), openFile, ( short ) 3); + The comment seems not accurate, here we write three block replica to openFile , right? Manoj Govindassamy . Or this should be replaced by "two more blocks".
          Hide
          manojg Manoj Govindassamy added a comment -

          Yiqun Lin,

          Thanks for the review. I was using the below writeFile method version, where the last param is the file replication factor and not block count. The method in turn creates a total of 2 blocks for the file with the provided repl factor. Is the comment still wrong ? please let me know.

            static protected void writeFile(FileSystem fileSys, Path name, int repl)
                throws IOException {
              writeFile(fileSys, name, repl, 2);
            }
          
          Show
          manojg Manoj Govindassamy added a comment - Yiqun Lin , Thanks for the review. I was using the below writeFile method version, where the last param is the file replication factor and not block count. The method in turn creates a total of 2 blocks for the file with the provided repl factor. Is the comment still wrong ? please let me know. static protected void writeFile(FileSystem fileSys, Path name, int repl) throws IOException { writeFile(fileSys, name, repl, 2); }
          Hide
          linyiqun Yiqun Lin added a comment -

          You are right, I misreading for this. +1 for the patch. Pending Jenkins.

          Show
          linyiqun Yiqun Lin added a comment - You are right, I misreading for this. +1 for the patch. Pending Jenkins.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          Thanks for working on this, Lukas Majercak and Manoj Govindassamy.

          While testing the 03 patch, the added testDecommissionWithOpenFileAndDatanodeFailing intermittently timeouts waiting for decomission. I'm looking into the cause.

          Show
          iwasakims Masatake Iwasaki added a comment - Thanks for working on this, Lukas Majercak and Manoj Govindassamy . While testing the 03 patch, the added testDecommissionWithOpenFileAndDatanodeFailing intermittently timeouts waiting for decomission. I'm looking into the cause.
          Hide
          lukmajercak Lukas Majercak added a comment - - edited

          Masatake Iwasaki, indeed it sometimes times out, also looking into the cause.

          Show
          lukmajercak Lukas Majercak added a comment - - edited Masatake Iwasaki , indeed it sometimes times out, also looking into the cause.
          Hide
          lukmajercak Lukas Majercak added a comment - - edited

          Looks like the issue is in the configuration, I have been running this test on 2.7.1 with no problems, and just found that trunk is missing some configurations, specifically :

          conf.setInt(DFSConfigKeys.DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY, 4).
          

          The test times out because of a block being in PendingReconstructionBlocks.

          Show
          lukmajercak Lukas Majercak added a comment - - edited Looks like the issue is in the configuration, I have been running this test on 2.7.1 with no problems, and just found that trunk is missing some configurations, specifically : conf.setInt(DFSConfigKeys.DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY, 4). The test times out because of a block being in PendingReconstructionBlocks.
          Hide
          elgoiri Íñigo Goiri added a comment -

          It looks like the DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY setting was removed in HDFS-9392.
          I don't see a good reason to remove it; I think we should bring it back.
          Lukas Majercak, do you mind adding the conf back?

          Show
          elgoiri Íñigo Goiri added a comment - It looks like the DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY setting was removed in HDFS-9392 . I don't see a good reason to remove it; I think we should bring it back. Lukas Majercak , do you mind adding the conf back?
          Hide
          elgoiri Íñigo Goiri added a comment -

          Unit test broken.

          Show
          elgoiri Íñigo Goiri added a comment - Unit test broken.
          Hide
          lukmajercak Lukas Majercak added a comment -

          Submitted a patch with the configuration added back to AdminStatesBaseTest.

          Show
          lukmajercak Lukas Majercak added a comment - Submitted a patch with the configuration added back to AdminStatesBaseTest .
          Hide
          iwasakims Masatake Iwasaki added a comment -

          The timeout seems to be relevant since replica recovery was not attempted after first 30 seconds in failed test case.

          $ grep 'initReplicaRecovery:' org.apache.hadoop.hdfs.TestDecommission-output.txt.failed
          2017-03-07 14:13:35,095 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO  impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) - initReplicaRecovery: blk_1073741826_1002, recoveryId=1004, replica=FinalizedReplica, blk_1073741826_1002, FINALIZED
          2017-03-07 14:13:35,096 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO  impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2440)) - initReplicaRecovery: changing replica state for blk_1073741826_1002 from FINALIZED to RUR
          ...snip
          2017-03-07 14:14:03,092 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO  impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) - initReplicaRecovery: blk_1073741826_1002, recoveryId=1018, replica=ReplicaUnderRecovery, blk_1073741826_1002, RUR
          2017-03-07 14:14:03,092 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO  impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2433)) - initReplicaRecovery: update recovery id for blk_1073741826_1002 from 1017 to 1018
          
          $ tail -n2 org.apache.hadoop.hdfs.TestDecommission-output.txt.failed
          2017-03-07 14:19:26,875 [main] INFO  impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)) - DataNode metrics system shutdown complete.
          2017-03-07 14:19:26,987 [Thread-11] INFO  hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:43314 to change state to Decommissioned current state: Decommission In Progress
          

          DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY was replace by DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY while keeping effective default value based on the description of HDFS-10219.

            public static final String  DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY =
                "dfs.namenode.reconstruction.pending.timeout-sec";
            public static final int
                DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_DEFAULT = 300;
          

          Trying to set the timeout to 4 in AdminStatesBaseTest#setup and seeing effect.

          Show
          iwasakims Masatake Iwasaki added a comment - The timeout seems to be relevant since replica recovery was not attempted after first 30 seconds in failed test case. $ grep 'initReplicaRecovery:' org.apache.hadoop.hdfs.TestDecommission-output.txt.failed 2017-03-07 14:13:35,095 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) - initReplicaRecovery: blk_1073741826_1002, recoveryId=1004, replica=FinalizedReplica, blk_1073741826_1002, FINALIZED 2017-03-07 14:13:35,096 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@7dcda518] INFO impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2440)) - initReplicaRecovery: changing replica state for blk_1073741826_1002 from FINALIZED to RUR ...snip 2017-03-07 14:14:03,092 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2382)) - initReplicaRecovery: blk_1073741826_1002, recoveryId=1018, replica=ReplicaUnderRecovery, blk_1073741826_1002, RUR 2017-03-07 14:14:03,092 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5c8628b1] INFO impl.FsDatasetImpl (FsDatasetImpl.java:initReplicaRecoveryImpl(2433)) - initReplicaRecovery: update recovery id for blk_1073741826_1002 from 1017 to 1018 $ tail -n2 org.apache.hadoop.hdfs.TestDecommission-output.txt.failed 2017-03-07 14:19:26,875 [main] INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(607)) - DataNode metrics system shutdown complete. 2017-03-07 14:19:26,987 [Thread-11] INFO hdfs.AdminStatesBaseTest (AdminStatesBaseTest.java:waitNodeState(342)) - Waiting for node 127.0.0.1:43314 to change state to Decommissioned current state: Decommission In Progress DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY was replace by DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY while keeping effective default value based on the description of HDFS-10219 . public static final String DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY = "dfs.namenode.reconstruction.pending.timeout-sec"; public static final int DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_DEFAULT = 300; Trying to set the timeout to 4 in AdminStatesBaseTest#setup and seeing effect.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          Setting DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY in TestDecoommission was removed by HDFS-9392. Ming Ma, was this intentional change?

          Show
          iwasakims Masatake Iwasaki added a comment - Setting DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY in TestDecoommission was removed by HDFS-9392 . Ming Ma , was this intentional change?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 31s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 14m 18s trunk passed
          +1 compile 0m 49s trunk passed
          +1 checkstyle 0m 36s trunk passed
          +1 mvnsite 0m 54s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 1m 57s trunk passed
          +1 javadoc 0m 45s trunk passed
          +1 mvninstall 0m 55s the patch passed
          +1 compile 0m 56s the patch passed
          +1 javac 0m 56s the patch passed
          -0 checkstyle 0m 35s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 146 unchanged - 0 fixed = 150 total (was 146)
          +1 mvnsite 0m 56s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 57s the patch passed
          +1 javadoc 0m 38s the patch passed
          -1 unit 80m 35s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 33s The patch does not generate ASF License warnings.
          108m 48s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.blockmanagement.TestReconstructStripedBlocksWithRackAwareness
            hadoop.hdfs.TestDecommission
            hadoop.hdfs.server.datanode.checker.TestThrottledAsyncChecker
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
          Timed out junit tests org.apache.hadoop.hdfs.TestLeaseRecovery2
            org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete
            org.apache.hadoop.hdfs.server.namenode.TestNamenodeCapacityReport
            org.apache.hadoop.hdfs.server.namenode.TestListCorruptFileBlocks
            org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionFunctional



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-11499
          GITHUB PR https://github.com/apache/hadoop/pull/199
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 81930ca04995 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 28daaf0
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18633/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18633/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18633/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18633/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 31s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 18s trunk passed +1 compile 0m 49s trunk passed +1 checkstyle 0m 36s trunk passed +1 mvnsite 0m 54s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 57s trunk passed +1 javadoc 0m 45s trunk passed +1 mvninstall 0m 55s the patch passed +1 compile 0m 56s the patch passed +1 javac 0m 56s the patch passed -0 checkstyle 0m 35s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 146 unchanged - 0 fixed = 150 total (was 146) +1 mvnsite 0m 56s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 57s the patch passed +1 javadoc 0m 38s the patch passed -1 unit 80m 35s hadoop-hdfs in the patch failed. +1 asflicense 0m 33s The patch does not generate ASF License warnings. 108m 48s Reason Tests Failed junit tests hadoop.hdfs.server.blockmanagement.TestReconstructStripedBlocksWithRackAwareness   hadoop.hdfs.TestDecommission   hadoop.hdfs.server.datanode.checker.TestThrottledAsyncChecker   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Timed out junit tests org.apache.hadoop.hdfs.TestLeaseRecovery2   org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete   org.apache.hadoop.hdfs.server.namenode.TestNamenodeCapacityReport   org.apache.hadoop.hdfs.server.namenode.TestListCorruptFileBlocks   org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionFunctional Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-11499 GITHUB PR https://github.com/apache/hadoop/pull/199 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 81930ca04995 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 28daaf0 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18633/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/18633/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18633/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18633/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user lukmajercak closed the pull request at:

          https://github.com/apache/hadoop/pull/199

          Show
          githubbot ASF GitHub Bot added a comment - Github user lukmajercak closed the pull request at: https://github.com/apache/hadoop/pull/199
          Hide
          lukmajercak Lukas Majercak added a comment -

          The build results were from the outdated github PR, closed that and resubmitted the patch.

          Show
          lukmajercak Lukas Majercak added a comment - The build results were from the outdated github PR, closed that and resubmitted the patch.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          I think the other uses of hasMinStorage() are correct. It seems this was introduced in HDFS-1172 a couple years ago

          The relevant code had existed before HDFS-1172, but anyway we should fix this.

          Show
          iwasakims Masatake Iwasaki added a comment - I think the other uses of hasMinStorage() are correct. It seems this was introduced in HDFS-1172 a couple years ago The relevant code had existed before HDFS-1172 , but anyway we should fix this.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          a comment about testDecommissionWithOpenFileAndDatanodeFailing.

          678	    // Kill one of the datanodes of the last block
          679	    getCluster().stopDataNode(lastBlockLocations[0].getName());
          

          I think this is misleading and makes test time unnecessary long. If my understanding is correct, the issue is reproduced only if nodes are in decommissioning state while trying to complete the last block.

          How about make nodes decommissioning first then invoke lease recovery? like

              // Decommission all nodes of the last block
              ArrayList<String> toDecom = new ArrayList<>();
              for (DatanodeInfo dnDecom : lastBlockLocations) {
                toDecom.add(dnDecom.getXferAddr());
              }
              initExcludeHosts(toDecom);
              refreshNodes(0);
          
              // Make sure hard lease expires
              getCluster().setLeasePeriod(300L, 300L);
              Thread.sleep(2 * BLOCKREPORT_INTERVAL_MSEC);
          
              for (DatanodeInfo dnDecom : lastBlockLocations) {
                DatanodeInfo datanode = NameNodeAdapter.getDatanode(
                    getCluster().getNamesystem(), dnDecom);
                waitNodeState(datanode, AdminStates.DECOMMISSIONED);
              }
          

          Stopping the datanode causes connection failure to the dead node and retry on replica recovery and just makes it highly probable that nodes are in decommissioning state before the last block is completed.

          Show
          iwasakims Masatake Iwasaki added a comment - a comment about testDecommissionWithOpenFileAndDatanodeFailing . 678 // Kill one of the datanodes of the last block 679 getCluster().stopDataNode(lastBlockLocations[0].getName()); I think this is misleading and makes test time unnecessary long. If my understanding is correct, the issue is reproduced only if nodes are in decommissioning state while trying to complete the last block. How about make nodes decommissioning first then invoke lease recovery? like // Decommission all nodes of the last block ArrayList<String> toDecom = new ArrayList<>(); for (DatanodeInfo dnDecom : lastBlockLocations) { toDecom.add(dnDecom.getXferAddr()); } initExcludeHosts(toDecom); refreshNodes(0); // Make sure hard lease expires getCluster().setLeasePeriod(300L, 300L); Thread.sleep(2 * BLOCKREPORT_INTERVAL_MSEC); for (DatanodeInfo dnDecom : lastBlockLocations) { DatanodeInfo datanode = NameNodeAdapter.getDatanode( getCluster().getNamesystem(), dnDecom); waitNodeState(datanode, AdminStates.DECOMMISSIONED); } Stopping the datanode causes connection failure to the dead node and retry on replica recovery and just makes it highly probable that nodes are in decommissioning state before the last block is completed.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          also we don't need to set DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY if the test does not depend on retrying behavior.

          Show
          iwasakims Masatake Iwasaki added a comment - also we don't need to set DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY if the test does not depend on retrying behavior.
          Hide
          lukmajercak Lukas Majercak added a comment -

          That looks good Masatake Iwasaki, the test you suggested fails without the change to BlockManager.commitOrCompleteLastBlock and finishes quicker with the change in place.
          Please see the new patch I've attached, with the DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY removed as well.

          Show
          lukmajercak Lukas Majercak added a comment - That looks good Masatake Iwasaki , the test you suggested fails without the change to BlockManager.commitOrCompleteLastBlock and finishes quicker with the change in place. Please see the new patch I've attached, with the DFS_NAMENODE_RECONSTRUCTION_PENDING_TIMEOUT_SEC_KEY removed as well.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 23s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 19m 0s trunk passed
          +1 compile 0m 45s trunk passed
          +1 checkstyle 0m 36s trunk passed
          +1 mvnsite 0m 52s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 42s trunk passed
          +1 javadoc 0m 41s trunk passed
          +1 mvninstall 0m 45s the patch passed
          +1 compile 0m 44s the patch passed
          +1 javac 0m 44s the patch passed
          -0 checkstyle 0m 33s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 146 unchanged - 0 fixed = 150 total (was 146)
          +1 mvnsite 0m 47s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 48s the patch passed
          +1 javadoc 0m 37s the patch passed
          -1 unit 85m 17s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          116m 28s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-11499
          GITHUB PR https://github.com/apache/hadoop/pull/199
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux e354c5a9ece8 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 5addacb
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18643/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18643/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18643/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18643/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 19m 0s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 36s trunk passed +1 mvnsite 0m 52s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 42s trunk passed +1 javadoc 0m 41s trunk passed +1 mvninstall 0m 45s the patch passed +1 compile 0m 44s the patch passed +1 javac 0m 44s the patch passed -0 checkstyle 0m 33s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 146 unchanged - 0 fixed = 150 total (was 146) +1 mvnsite 0m 47s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 48s the patch passed +1 javadoc 0m 37s the patch passed -1 unit 85m 17s hadoop-hdfs in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 116m 28s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-11499 GITHUB PR https://github.com/apache/hadoop/pull/199 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux e354c5a9ece8 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 5addacb Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/18643/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/18643/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18643/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18643/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          lukmajercak Lukas Majercak added a comment -

          The failing tests seem unrelated to the patch.

          Show
          lukmajercak Lukas Majercak added a comment - The failing tests seem unrelated to the patch.
          Hide
          elgoiri Íñigo Goiri added a comment -

          The unit tests in version 05 of the patch seem cleaner. LGTM.

          Show
          elgoiri Íñigo Goiri added a comment - The unit tests in version 05 of the patch seem cleaner. LGTM.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          +1 on 05, will commit it shortly.

          Show
          iwasakims Masatake Iwasaki added a comment - +1 on 05, will commit it shortly.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11379 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11379/)
          HDFS-11499. Decommissioning stuck because of failing recovery. (iwasakims: rev 385d2cb777a0272ac20c62336c944fad295d5d12)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestMaintenanceState.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11379 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11379/ ) HDFS-11499 . Decommissioning stuck because of failing recovery. (iwasakims: rev 385d2cb777a0272ac20c62336c944fad295d5d12) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestMaintenanceState.java
          Hide
          iwasakims Masatake Iwasaki added a comment -

          Committed to branch-2 and trunk. Thanks to all for the contribution.

          Show
          iwasakims Masatake Iwasaki added a comment - Committed to branch-2 and trunk. Thanks to all for the contribution.
          Hide
          manojg Manoj Govindassamy added a comment -

          Thanks for digging through the test failures and fix Lukas Majercak and review, commit help Masatake Iwasaki.

          Show
          manojg Manoj Govindassamy added a comment - Thanks for digging through the test failures and fix Lukas Majercak and review, commit help Masatake Iwasaki .
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          I think we should backport the fix into 2.7 and 2.8. It fixes a bug that makes file close() to fail. Administrators may think it suffer from corruption, because file recovery will also fail.

          Here's a branch-2.8 patch.

          Show
          jojochuang Wei-Chiu Chuang added a comment - I think we should backport the fix into 2.7 and 2.8. It fixes a bug that makes file close() to fail. Administrators may think it suffer from corruption, because file recovery will also fail. Here's a branch-2.8 patch.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Reopen the issue to submit the branch-2.8 patch

          Show
          jojochuang Wei-Chiu Chuang added a comment - Reopen the issue to submit the branch-2.8 patch
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 7s HDFS-11499 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HDFS-11499
          GITHUB PR https://github.com/apache/hadoop/pull/199
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18675/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 7s HDFS-11499 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-11499 GITHUB PR https://github.com/apache/hadoop/pull/199 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18675/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Branch-2.7 patch

          Show
          jojochuang Wei-Chiu Chuang added a comment - Branch-2.7 patch
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 8s HDFS-11499 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HDFS-11499
          GITHUB PR https://github.com/apache/hadoop/pull/199
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18676/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 8s HDFS-11499 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HDFS-11499 GITHUB PR https://github.com/apache/hadoop/pull/199 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18676/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          The precommit build failed because it attempted to run the patch from github PR.
          Not sure how to force it to take the patch from here, but the branch-2.8 and branch-2.7 patches are straightforward with minor conflicts.

          Show
          jojochuang Wei-Chiu Chuang added a comment - The precommit build failed because it attempted to run the patch from github PR. Not sure how to force it to take the patch from here, but the branch-2.8 and branch-2.7 patches are straightforward with minor conflicts.
          Hide
          andrew.wang Andrew Wang added a comment -

          I'm +1 if it's minor conflicts. Precommit won't run against a patch once there's a github PR.

          Show
          andrew.wang Andrew Wang added a comment - I'm +1 if it's minor conflicts. Precommit won't run against a patch once there's a github PR.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks for the review, Andrew Wang. Pushed the commit into branch-2.7 and branch-2.8.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks for the review, Andrew Wang . Pushed the commit into branch-2.7 and branch-2.8.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - 2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

            People

            • Assignee:
              lukmajercak Lukas Majercak
              Reporter:
              lukmajercak Lukas Majercak
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development