Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-742

A down DataNode makes Balancer to hang on repeatingly asking NameNode its partial block list

    Details

    • Hadoop Flags:
      Reviewed

      Description

      We had a balancer that had not made any progress for a long time. It turned out it was repeatingly asking Namenode for a partial block list of one datanode, which was done while the balancer was running.

      NameNode should notify Balancer that the datanode is not available and Balancer should stop asking for the datanode's block list.

      1. HDFS-742.patch
        1 kB
        Mit Desai
      2. HDFS-742.v1.trunk.patch
        2 kB
        Kihwal Lee
      3. HDFS-742-trunk.patch
        1 kB
        Mit Desai

        Activity

        Hide
        mitdesai Mit Desai added a comment -

        Hey Hairong Kuang, are you still working on this JIRA? If not, I can take it over and work on it.

        Show
        mitdesai Mit Desai added a comment - Hey Hairong Kuang , are you still working on this JIRA? If not, I can take it over and work on it.
        Hide
        mitdesai Mit Desai added a comment -

        Taking this over. Feel free to reassign if you are still working on it.

        Show
        mitdesai Mit Desai added a comment - Taking this over. Feel free to reassign if you are still working on it.
        Hide
        mitdesai Mit Desai added a comment -

        Attaching the patch. Unfortunately I do not have a way to reproduce the issue so I'm unable to have a test to verify the change.
        Here is the explanation of the part of the Balancer code makes it hang forever.

        In the following while loop in Balancer.java, when the Balancer figures out that it should fetch more blocks, it gets the BlockList and decrements the blockToReceive by that many blocks. It again starts from the top of the loop after that.

         while(!isTimeUp && getScheduledSize()>0 &&
                  (!srcBlockList.isEmpty() || blocksToReceive>0)) {
               
        ## SOME LINES OMITTED ##
        
                filterMovedBlocks(); // filter already moved blocks
                if (shouldFetchMoreBlocks()) {
                  // fetch new blocks
                  try {
                    blocksToReceive -= getBlockList();
                    continue;
                  } catch (IOException e) {
                    
        ## SOME LINES OMITTED ##
                
                // check if time is up or not
                if (Time.now()-startTime > MAX_ITERATION_TIME) {
                  isTimeUp = true;
                  continue;
                }
        ## SOME LINES OMITTED ##
        
         }
        

        The problem here is, if the datanode is decommissioned, the getBlockList() method will not return anything and the blocksToReceive will not be changed. It will keep on doing this indefinitely as the blocksToReceive will always be greater than 0. The isTimeUp will never be set to true as it will never reach that part of the code. In the patch that is submitted, the Time up condition is moved to the top of the loop. So it will check if isTimeUp is set and proceed ahead only if time up is not encountered.

        Show
        mitdesai Mit Desai added a comment - Attaching the patch. Unfortunately I do not have a way to reproduce the issue so I'm unable to have a test to verify the change. Here is the explanation of the part of the Balancer code makes it hang forever. In the following while loop in Balancer.java, when the Balancer figures out that it should fetch more blocks, it gets the BlockList and decrements the blockToReceive by that many blocks. It again starts from the top of the loop after that. while (!isTimeUp && getScheduledSize()>0 && (!srcBlockList.isEmpty() || blocksToReceive>0)) { ## SOME LINES OMITTED ## filterMovedBlocks(); // filter already moved blocks if (shouldFetchMoreBlocks()) { // fetch new blocks try { blocksToReceive -= getBlockList(); continue ; } catch (IOException e) { ## SOME LINES OMITTED ## // check if time is up or not if (Time.now()-startTime > MAX_ITERATION_TIME) { isTimeUp = true ; continue ; } ## SOME LINES OMITTED ## } The problem here is, if the datanode is decommissioned, the getBlockList() method will not return anything and the blocksToReceive will not be changed. It will keep on doing this indefinitely as the blocksToReceive will always be greater than 0. The isTimeUp will never be set to true as it will never reach that part of the code. In the patch that is submitted, the Time up condition is moved to the top of the loop. So it will check if isTimeUp is set and proceed ahead only if time up is not encountered.
        Hide
        szetszwo Tsz Wo Nicholas Sze added a comment -

        Mit Desai, sorry for the late review. The patch looks good. Could you update the patch with current trunk?

        Show
        szetszwo Tsz Wo Nicholas Sze added a comment - Mit Desai , sorry for the late review. The patch looks good. Could you update the patch with current trunk?
        Hide
        mitdesai Mit Desai added a comment -

        Attached modified patch. But still, I do not have a unit test for the fix

        Show
        mitdesai Mit Desai added a comment - Attached modified patch. But still, I do not have a unit test for the fix
        Hide
        szetszwo Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Show
        szetszwo Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        -1 pre-patch 16m 21s Findbugs (version ) appears to be broken on trunk.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 javac 7m 58s There were no new javac warning messages.
        +1 javadoc 9m 55s There were no new javadoc warning messages.
        +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 0m 31s There were no new checkstyle issues.
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 32s mvn install still works.
        +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
        +1 findbugs 2m 34s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 native 3m 15s Pre-build of native portion
        -1 hdfs tests 161m 5s Tests failed in hadoop-hdfs.
            204m 12s  



        Reason Tests
        Failed unit tests hadoop.hdfs.TestDFSUpgradeFromImage
          hadoop.fs.TestSymlinkHdfsFileSystem
          hadoop.fs.viewfs.TestViewFileSystemHdfs



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12747558/HDFS-742-trunk.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 88d8736
        hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/11875/artifact/patchprocess/testrun_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/11875/testReport/
        Java 1.7.0_55
        uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11875/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 16m 21s Findbugs (version ) appears to be broken on trunk. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 58s There were no new javac warning messages. +1 javadoc 9m 55s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 31s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 32s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 2m 34s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 native 3m 15s Pre-build of native portion -1 hdfs tests 161m 5s Tests failed in hadoop-hdfs.     204m 12s   Reason Tests Failed unit tests hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.fs.TestSymlinkHdfsFileSystem   hadoop.fs.viewfs.TestViewFileSystemHdfs Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12747558/HDFS-742-trunk.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 88d8736 hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/11875/artifact/patchprocess/testrun_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/11875/testReport/ Java 1.7.0_55 uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11875/console This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment -

        This somehow didn't get committed. Rebasing to trunk.

        Show
        kihwal Kihwal Lee added a comment - This somehow didn't get committed. Rebasing to trunk.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 7m 9s trunk passed
        +1 compile 0m 48s trunk passed
        +1 checkstyle 0m 27s trunk passed
        +1 mvnsite 0m 52s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 1m 45s trunk passed
        +1 javadoc 0m 57s trunk passed
        +1 mvninstall 0m 50s the patch passed
        +1 compile 0m 46s the patch passed
        +1 javac 0m 46s the patch passed
        +1 checkstyle 0m 24s the patch passed
        +1 mvnsite 0m 55s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 48s the patch passed
        +1 javadoc 0m 53s the patch passed
        -1 unit 61m 19s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        81m 0s



        Reason Tests
        Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821694/HDFS-742.v1.trunk.patch
        JIRA Issue HDFS-742
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 882346dd1655 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 954465e
        Default Java 1.8.0_101
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/16296/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16296/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16296/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 9s trunk passed +1 compile 0m 48s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 52s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 45s trunk passed +1 javadoc 0m 57s trunk passed +1 mvninstall 0m 50s the patch passed +1 compile 0m 46s the patch passed +1 javac 0m 46s the patch passed +1 checkstyle 0m 24s the patch passed +1 mvnsite 0m 55s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 48s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 61m 19s hadoop-hdfs in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 81m 0s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821694/HDFS-742.v1.trunk.patch JIRA Issue HDFS-742 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 882346dd1655 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 954465e Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16296/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16296/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16296/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment -

        The test failure is due to HDFS-9781.

        Show
        kihwal Kihwal Lee added a comment - The test failure is due to HDFS-9781 .
        Hide
        daryn Daryn Sharp added a comment -

        +1 A triple-digit jira!

        Show
        daryn Daryn Sharp added a comment - +1 A triple-digit jira!
        Hide
        kihwal Kihwal Lee added a comment -

        Thanks for working on this Mit Desai. I've committed this to trunk, branch-2 and branch-2.8.

        Show
        kihwal Kihwal Lee added a comment - Thanks for working on this Mit Desai . I've committed this to trunk, branch-2 and branch-2.8.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #10203 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10203/)
        HDFS-742. A down DataNode makes Balancer to hang on repeatingly asking (kihwal: rev 58db263e93daf08280e6a586a10cebd6122cf72a)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10203 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10203/ ) HDFS-742 . A down DataNode makes Balancer to hang on repeatingly asking (kihwal: rev 58db263e93daf08280e6a586a10cebd6122cf72a) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java

          People

          • Assignee:
            mitdesai Mit Desai
            Reporter:
            hairong Hairong Kuang
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development