Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0-alpha1
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Filing this issue in response to ``full disk woes`` on hdfs-user.

      Datanodes fill their storage directories unevenly, leading to situations where certain disks are full while others are significantly less used. Users at many different sites have experienced this issue, and HDFS administrators are taking steps like:

      • Manually rebalancing blocks in storage directories
      • Decomissioning nodes & later readding them

      There's a tradeoff between making use of all available spindles, and filling disks at the sameish rate. Possible solutions include:

      • Weighting less-used disks heavier when placing new blocks on the datanode. In write-heavy environments this will still make use of all spindles, equalizing disk use over time.
      • Rebalancing blocks locally. This would help equalize disk use as disks are added/replaced in older cluster nodes.

      Datanodes should actively manage their local disk so operator intervention is not needed.

      1. HDFS-1312.007.patch
        744 kB
        Arpit Agarwal
      2. HDFS-1312.006.patch
        743 kB
        Anu Engineer
      3. HDFS-1312.005.patch
        743 kB
        Arpit Agarwal
      4. HDFS-1312.004.patch
        743 kB
        Anu Engineer
      5. HDFS-1312.003.patch
        738 kB
        Arpit Agarwal
      6. HDFS-1312.002.patch
        739 kB
        Anu Engineer
      7. HDFS-1312.001.patch
        739 kB
        Anu Engineer
      8. Architecture_and_test_update.pdf
        273 kB
        Anu Engineer
      9. Architecture_and_testplan.pdf
        218 kB
        Anu Engineer
      10. disk-balancer-proposal.pdf
        328 kB
        Anu Engineer

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Arpit Agarwal added a comment -

          The checkstyle failures were 'hides a field' and one long method which was not added by this patch.

          I've merged the HDFS-1312 feature branch to trunk. Thanks for the code contribution Anu Engineer, Xiaobing Zhou, Lei (Eddy) Xu and Yiqun Lin. Thanks to everyone else who contributed ideas and feedback on this historical jira. Users frequently request this feature and it felt good to commit it.

          Anu or I will resolve this Jira shortly and move out the remaining sub-tasks to a follow-up Jira.

          Show
          Arpit Agarwal added a comment - The checkstyle failures were 'hides a field' and one long method which was not added by this patch. I've merged the HDFS-1312 feature branch to trunk. Thanks for the code contribution Anu Engineer , Xiaobing Zhou , Lei (Eddy) Xu and Yiqun Lin . Thanks to everyone else who contributed ideas and feedback on this historical jira. Users frequently request this feature and it felt good to commit it. Anu or I will resolve this Jira shortly and move out the remaining sub-tasks to a follow-up Jira.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10014 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10014/)
          HDFS-8821. Stop tracking CHANGES.txt in the HDFS-1312 feature branch. (arp: rev 4b93ddae07ba4332f40f896542ee2c6d7bf899ed)

          • hadoop-hdfs-project/hadoop-hdfs/pom.xml
          • hadoop-hdfs-project/hadoop-hdfs/HDFS-1312_CHANGES.txt
            Fix a build break in HDFS-1312 (arp: rev 9be9703716d2787cd6ee0ebbbe44a18b1f039018)
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/diskbalancer/DiskBalancerException.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10014 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10014/ ) HDFS-8821 . Stop tracking CHANGES.txt in the HDFS-1312 feature branch. (arp: rev 4b93ddae07ba4332f40f896542ee2c6d7bf899ed) hadoop-hdfs-project/hadoop-hdfs/pom.xml hadoop-hdfs-project/hadoop-hdfs/ HDFS-1312 _CHANGES.txt Fix a build break in HDFS-1312 (arp: rev 9be9703716d2787cd6ee0ebbbe44a18b1f039018) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/diskbalancer/DiskBalancerException.java
          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 23s Docker mode activated.
          0 shelldocs 0m 0s Shelldocs was not available.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 15 new or modified test files.
          0 mvndep 0m 16s Maven dependency ordering for branch
          +1 mvninstall 6m 18s trunk passed
          +1 compile 1m 26s trunk passed
          +1 checkstyle 0m 47s trunk passed
          +1 mvnsite 1m 30s trunk passed
          +1 mvneclipse 0m 29s trunk passed
          +1 findbugs 3m 9s trunk passed
          +1 javadoc 1m 20s trunk passed
          0 mvndep 0m 8s Maven dependency ordering for patch
          +1 mvninstall 1m 20s the patch passed
          +1 compile 1m 23s the patch passed
          +1 cc 1m 23s the patch passed
          +1 javac 1m 23s the patch passed
          -0 checkstyle 0m 45s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002)
          +1 mvnsite 1m 25s the patch passed
          +1 mvneclipse 0m 23s the patch passed
          +1 shellcheck 0m 12s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76)
          +1 whitespace 0m 1s The patch has no whitespace issues.
          +1 xml 0m 2s The patch has no ill-formed XML file.
          +1 findbugs 3m 21s the patch passed
          +1 javadoc 1m 14s the patch passed
          +1 unit 0m 54s hadoop-hdfs-client in the patch passed.
          -1 unit 72m 29s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 24s The patch does not generate ASF License warnings.
          101m 22s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
            hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812975/HDFS-1312.007.patch
          JIRA Issue HDFS-1312
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs
          uname Linux 14ef8c22f1d7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0b9edf6
          Default Java 1.8.0_91
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15900/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15900/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15900/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15900/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. 0 shelldocs 0m 0s Shelldocs was not available. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 15 new or modified test files. 0 mvndep 0m 16s Maven dependency ordering for branch +1 mvninstall 6m 18s trunk passed +1 compile 1m 26s trunk passed +1 checkstyle 0m 47s trunk passed +1 mvnsite 1m 30s trunk passed +1 mvneclipse 0m 29s trunk passed +1 findbugs 3m 9s trunk passed +1 javadoc 1m 20s trunk passed 0 mvndep 0m 8s Maven dependency ordering for patch +1 mvninstall 1m 20s the patch passed +1 compile 1m 23s the patch passed +1 cc 1m 23s the patch passed +1 javac 1m 23s the patch passed -0 checkstyle 0m 45s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002) +1 mvnsite 1m 25s the patch passed +1 mvneclipse 0m 23s the patch passed +1 shellcheck 0m 12s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76) +1 whitespace 0m 1s The patch has no whitespace issues. +1 xml 0m 2s The patch has no ill-formed XML file. +1 findbugs 3m 21s the patch passed +1 javadoc 1m 14s the patch passed +1 unit 0m 54s hadoop-hdfs-client in the patch passed. -1 unit 72m 29s hadoop-hdfs in the patch failed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 101m 22s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812975/HDFS-1312.007.patch JIRA Issue HDFS-1312 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs uname Linux 14ef8c22f1d7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0b9edf6 Default Java 1.8.0_91 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15900/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15900/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15900/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15900/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Arpit Agarwal added a comment -

          v7 - patch for Jenkins run after rebasing the feature branch on top of trunk.

          Show
          Arpit Agarwal added a comment - v7 - patch for Jenkins run after rebasing the feature branch on top of trunk.
          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 12m 16s Docker mode activated.
          0 shelldocs 0m 1s Shelldocs was not available.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 15 new or modified test files.
          0 mvndep 0m 42s Maven dependency ordering for branch
          +1 mvninstall 7m 39s trunk passed
          +1 compile 1m 41s trunk passed
          +1 checkstyle 0m 50s trunk passed
          +1 mvnsite 1m 50s trunk passed
          +1 mvneclipse 0m 30s trunk passed
          +1 findbugs 3m 43s trunk passed
          +1 javadoc 1m 30s trunk passed
          0 mvndep 0m 9s Maven dependency ordering for patch
          +1 mvninstall 1m 39s the patch passed
          +1 compile 1m 37s the patch passed
          +1 cc 1m 37s the patch passed
          +1 javac 1m 37s the patch passed
          -0 checkstyle 0m 50s hadoop-hdfs-project: The patch generated 20 new + 1000 unchanged - 2 fixed = 1020 total (was 1002)
          +1 mvnsite 1m 39s the patch passed
          +1 mvneclipse 0m 26s the patch passed
          +1 shellcheck 0m 14s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76)
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 xml 0m 3s The patch has no ill-formed XML file.
          +1 findbugs 4m 3s the patch passed
          +1 javadoc 1m 20s the patch passed
          +1 unit 1m 4s hadoop-hdfs-client in the patch passed.
          -1 unit 61m 0s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          107m 0s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestEditLog
            hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812949/HDFS-1312.006.patch
          JIRA Issue HDFS-1312
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs
          uname Linux c0410e1298a2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / dca298d
          Default Java 1.8.0_91
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15895/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15895/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15895/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15895/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 12m 16s Docker mode activated. 0 shelldocs 0m 1s Shelldocs was not available. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 15 new or modified test files. 0 mvndep 0m 42s Maven dependency ordering for branch +1 mvninstall 7m 39s trunk passed +1 compile 1m 41s trunk passed +1 checkstyle 0m 50s trunk passed +1 mvnsite 1m 50s trunk passed +1 mvneclipse 0m 30s trunk passed +1 findbugs 3m 43s trunk passed +1 javadoc 1m 30s trunk passed 0 mvndep 0m 9s Maven dependency ordering for patch +1 mvninstall 1m 39s the patch passed +1 compile 1m 37s the patch passed +1 cc 1m 37s the patch passed +1 javac 1m 37s the patch passed -0 checkstyle 0m 50s hadoop-hdfs-project: The patch generated 20 new + 1000 unchanged - 2 fixed = 1020 total (was 1002) +1 mvnsite 1m 39s the patch passed +1 mvneclipse 0m 26s the patch passed +1 shellcheck 0m 14s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76) +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 3s The patch has no ill-formed XML file. +1 findbugs 4m 3s the patch passed +1 javadoc 1m 20s the patch passed +1 unit 1m 4s hadoop-hdfs-client in the patch passed. -1 unit 61m 0s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 107m 0s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestEditLog   hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812949/HDFS-1312.006.patch JIRA Issue HDFS-1312 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs uname Linux c0410e1298a2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / dca298d Default Java 1.8.0_91 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15895/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15895/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15895/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15895/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Anu Engineer added a comment -

          Updating the patch against trunk. Hopefully the last patch before merge.

          Show
          Anu Engineer added a comment - Updating the patch against trunk. Hopefully the last patch before merge.
          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 16m 1s Docker mode activated.
          0 shelldocs 0m 0s Shelldocs was not available.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 15 new or modified test files.
          0 mvndep 0m 18s Maven dependency ordering for branch
          +1 mvninstall 8m 21s trunk passed
          +1 compile 1m 58s trunk passed
          +1 checkstyle 0m 50s trunk passed
          +1 mvnsite 1m 57s trunk passed
          +1 mvneclipse 0m 33s trunk passed
          +1 findbugs 3m 19s trunk passed
          +1 javadoc 1m 25s trunk passed
          0 mvndep 0m 9s Maven dependency ordering for patch
          +1 mvninstall 1m 29s the patch passed
          +1 compile 1m 30s the patch passed
          +1 cc 1m 30s the patch passed
          +1 javac 1m 30s the patch passed
          -0 checkstyle 0m 49s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002)
          +1 mvnsite 1m 31s the patch passed
          +1 mvneclipse 0m 24s the patch passed
          +1 shellcheck 0m 13s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76)
          +1 whitespace 0m 1s The patch has no whitespace issues.
          +1 xml 0m 2s The patch has no ill-formed XML file.
          +1 findbugs 3m 26s the patch passed
          +1 javadoc 1m 20s the patch passed
          +1 unit 0m 57s hadoop-hdfs-client in the patch passed.
          -1 unit 75m 18s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 24s The patch does not generate ASF License warnings.
          124m 5s



          Reason Tests
          Failed junit tests hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer
            hadoop.hdfs.server.blockmanagement.TestBlockManager
            hadoop.hdfs.server.balancer.TestBalancer
            hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand
            hadoop.hdfs.server.namenode.TestDecommissioningStatus



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812894/HDFS-1312.005.patch
          JIRA Issue HDFS-1312
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs
          uname Linux 0d336b5416e0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / e98c0c7
          Default Java 1.8.0_91
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15892/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15892/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15892/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15892/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 16m 1s Docker mode activated. 0 shelldocs 0m 0s Shelldocs was not available. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 15 new or modified test files. 0 mvndep 0m 18s Maven dependency ordering for branch +1 mvninstall 8m 21s trunk passed +1 compile 1m 58s trunk passed +1 checkstyle 0m 50s trunk passed +1 mvnsite 1m 57s trunk passed +1 mvneclipse 0m 33s trunk passed +1 findbugs 3m 19s trunk passed +1 javadoc 1m 25s trunk passed 0 mvndep 0m 9s Maven dependency ordering for patch +1 mvninstall 1m 29s the patch passed +1 compile 1m 30s the patch passed +1 cc 1m 30s the patch passed +1 javac 1m 30s the patch passed -0 checkstyle 0m 49s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002) +1 mvnsite 1m 31s the patch passed +1 mvneclipse 0m 24s the patch passed +1 shellcheck 0m 13s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76) +1 whitespace 0m 1s The patch has no whitespace issues. +1 xml 0m 2s The patch has no ill-formed XML file. +1 findbugs 3m 26s the patch passed +1 javadoc 1m 20s the patch passed +1 unit 0m 57s hadoop-hdfs-client in the patch passed. -1 unit 75m 18s hadoop-hdfs in the patch failed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 124m 5s Reason Tests Failed junit tests hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer   hadoop.hdfs.server.blockmanagement.TestBlockManager   hadoop.hdfs.server.balancer.TestBalancer   hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand   hadoop.hdfs.server.namenode.TestDecommissioningStatus Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812894/HDFS-1312.005.patch JIRA Issue HDFS-1312 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs uname Linux 0d336b5416e0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e98c0c7 Default Java 1.8.0_91 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15892/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15892/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15892/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15892/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Anu Engineer added a comment -

          Posting another merge patch against trunk.

          Show
          Anu Engineer added a comment - Posting another merge patch against trunk.
          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
          0 shelldocs 0m 0s Shelldocs was not available.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 15 new or modified test files.
          0 mvndep 0m 13s Maven dependency ordering for branch
          +1 mvninstall 6m 38s trunk passed
          +1 compile 1m 28s trunk passed
          +1 checkstyle 0m 48s trunk passed
          +1 mvnsite 1m 30s trunk passed
          +1 mvneclipse 0m 30s trunk passed
          +1 findbugs 3m 14s trunk passed
          +1 javadoc 1m 20s trunk passed
          0 mvndep 0m 8s Maven dependency ordering for patch
          +1 mvninstall 1m 21s the patch passed
          +1 compile 1m 23s the patch passed
          +1 cc 1m 23s the patch passed
          +1 javac 1m 23s the patch passed
          -0 checkstyle 0m 43s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002)
          +1 mvnsite 1m 31s the patch passed
          +1 mvneclipse 0m 23s the patch passed
          +1 shellcheck 0m 14s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76)
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 xml 0m 2s The patch has no ill-formed XML file.
          +1 findbugs 3m 40s the patch passed
          +1 javadoc 1m 16s the patch passed
          +1 unit 0m 59s hadoop-hdfs-client in the patch passed.
          -1 unit 60m 17s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 25s The patch does not generate ASF License warnings.
          90m 13s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
            hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer
            hadoop.tools.TestHdfsConfigFields
            hadoop.hdfs.server.namenode.TestCacheDirectives
            hadoop.hdfs.server.namenode.TestNamenodeRetryCache



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:e2f6409
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812506/HDFS-1312.003.patch
          JIRA Issue HDFS-1312
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs
          uname Linux 08332ffa6c3b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4ee3543
          Default Java 1.8.0_91
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15873/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15873/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15873/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15873/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated. 0 shelldocs 0m 0s Shelldocs was not available. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 15 new or modified test files. 0 mvndep 0m 13s Maven dependency ordering for branch +1 mvninstall 6m 38s trunk passed +1 compile 1m 28s trunk passed +1 checkstyle 0m 48s trunk passed +1 mvnsite 1m 30s trunk passed +1 mvneclipse 0m 30s trunk passed +1 findbugs 3m 14s trunk passed +1 javadoc 1m 20s trunk passed 0 mvndep 0m 8s Maven dependency ordering for patch +1 mvninstall 1m 21s the patch passed +1 compile 1m 23s the patch passed +1 cc 1m 23s the patch passed +1 javac 1m 23s the patch passed -0 checkstyle 0m 43s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002) +1 mvnsite 1m 31s the patch passed +1 mvneclipse 0m 23s the patch passed +1 shellcheck 0m 14s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76) +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 2s The patch has no ill-formed XML file. +1 findbugs 3m 40s the patch passed +1 javadoc 1m 16s the patch passed +1 unit 0m 59s hadoop-hdfs-client in the patch passed. -1 unit 60m 17s hadoop-hdfs in the patch failed. +1 asflicense 0m 25s The patch does not generate ASF License warnings. 90m 13s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA   hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer   hadoop.tools.TestHdfsConfigFields   hadoop.hdfs.server.namenode.TestCacheDirectives   hadoop.hdfs.server.namenode.TestNamenodeRetryCache Subsystem Report/Notes Docker Image:yetus/hadoop:e2f6409 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812506/HDFS-1312.003.patch JIRA Issue HDFS-1312 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs uname Linux 08332ffa6c3b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4ee3543 Default Java 1.8.0_91 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15873/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15873/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15873/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15873/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 29s Docker mode activated.
          0 shelldocs 0m 1s Shelldocs was not available.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 15 new or modified test files.
          0 mvndep 0m 10s Maven dependency ordering for branch
          +1 mvninstall 9m 29s trunk passed
          +1 compile 2m 1s trunk passed
          +1 checkstyle 0m 55s trunk passed
          +1 mvnsite 2m 13s trunk passed
          +1 mvneclipse 0m 34s trunk passed
          +1 findbugs 4m 15s trunk passed
          +1 javadoc 1m 34s trunk passed
          0 mvndep 0m 8s Maven dependency ordering for patch
          +1 mvninstall 1m 36s the patch passed
          +1 compile 1m 52s the patch passed
          +1 cc 1m 52s the patch passed
          +1 javac 1m 52s the patch passed
          -1 checkstyle 0m 47s hadoop-hdfs-project: The patch generated 18 new + 1001 unchanged - 2 fixed = 1019 total (was 1003)
          +1 mvnsite 1m 34s the patch passed
          +1 mvneclipse 0m 24s the patch passed
          +1 shellcheck 0m 15s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76)
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 3m 56s the patch passed
          +1 javadoc 1m 22s the patch passed
          +1 unit 1m 1s hadoop-hdfs-client in the patch passed.
          -1 unit 62m 42s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 29s The patch does not generate ASF License warnings.
          99m 40s



          Reason Tests
          Failed junit tests hadoop.tools.TestHdfsConfigFields
            hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:e2f6409
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12811864/HDFS-1312.002.patch
          JIRA Issue HDFS-1312
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs
          uname Linux 2b8e9ba71dec 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 5107a96
          Default Java 1.8.0_91
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15833/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15833/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15833/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15833/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15833/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 29s Docker mode activated. 0 shelldocs 0m 1s Shelldocs was not available. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 15 new or modified test files. 0 mvndep 0m 10s Maven dependency ordering for branch +1 mvninstall 9m 29s trunk passed +1 compile 2m 1s trunk passed +1 checkstyle 0m 55s trunk passed +1 mvnsite 2m 13s trunk passed +1 mvneclipse 0m 34s trunk passed +1 findbugs 4m 15s trunk passed +1 javadoc 1m 34s trunk passed 0 mvndep 0m 8s Maven dependency ordering for patch +1 mvninstall 1m 36s the patch passed +1 compile 1m 52s the patch passed +1 cc 1m 52s the patch passed +1 javac 1m 52s the patch passed -1 checkstyle 0m 47s hadoop-hdfs-project: The patch generated 18 new + 1001 unchanged - 2 fixed = 1019 total (was 1003) +1 mvnsite 1m 34s the patch passed +1 mvneclipse 0m 24s the patch passed +1 shellcheck 0m 15s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76) +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 3m 56s the patch passed +1 javadoc 1m 22s the patch passed +1 unit 1m 1s hadoop-hdfs-client in the patch passed. -1 unit 62m 42s hadoop-hdfs in the patch failed. +1 asflicense 0m 29s The patch does not generate ASF License warnings. 99m 40s Reason Tests Failed junit tests hadoop.tools.TestHdfsConfigFields   hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand Subsystem Report/Notes Docker Image:yetus/hadoop:e2f6409 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12811864/HDFS-1312.002.patch JIRA Issue HDFS-1312 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs uname Linux 2b8e9ba71dec 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 5107a96 Default Java 1.8.0_91 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15833/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15833/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15833/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15833/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15833/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          Anu Engineer added a comment -
          • Fixed a set of issues reported by Lei (Eddy) Xu
          • Fixed white spaces issues in documentation
          Show
          Anu Engineer added a comment - Fixed a set of issues reported by Lei (Eddy) Xu Fixed white spaces issues in documentation
          Hide
          Anu Engineer added a comment -

          Allen Wittenauer Thanks will do.

          Show
          Anu Engineer added a comment - Allen Wittenauer Thanks will do.
          Hide
          Allen Wittenauer added a comment -

          White Space warning is due to doc changes.

          You should still remove them.

          Show
          Allen Wittenauer added a comment - White Space warning is due to doc changes. You should still remove them.
          Hide
          Anu Engineer added a comment -
          • Checkstyle issues are mostly - variable hides a field.
          • White Space warning is due to doc changes.
          • Test failures are not related to this patch
          Show
          Anu Engineer added a comment - Checkstyle issues are mostly - variable hides a field. White Space warning is due to doc changes. Test failures are not related to this patch
          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          0 shelldocs 0m 1s Shelldocs was not available.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 15 new or modified test files.
          0 mvndep 0m 21s Maven dependency ordering for branch
          +1 mvninstall 8m 25s trunk passed
          +1 compile 1m 49s trunk passed
          +1 checkstyle 0m 48s trunk passed
          +1 mvnsite 1m 37s trunk passed
          +1 mvneclipse 0m 31s trunk passed
          +1 findbugs 3m 27s trunk passed
          +1 javadoc 1m 22s trunk passed
          0 mvndep 0m 8s Maven dependency ordering for patch
          +1 mvninstall 1m 25s the patch passed
          +1 compile 1m 30s the patch passed
          +1 cc 1m 30s the patch passed
          +1 javac 1m 30s the patch passed
          -1 checkstyle 0m 44s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002)
          +1 mvnsite 1m 28s the patch passed
          +1 mvneclipse 0m 22s the patch passed
          +1 shellcheck 0m 13s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76)
          -1 whitespace 0m 0s The patch 10 line(s) with tabs.
          +1 xml 0m 2s The patch has no ill-formed XML file.
          +1 findbugs 3m 34s the patch passed
          +1 javadoc 1m 15s the patch passed
          +1 unit 0m 55s hadoop-hdfs-client in the patch passed.
          -1 unit 60m 24s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 26s The patch does not generate ASF License warnings.
          92m 51s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
            hadoop.tools.TestHdfsConfigFields



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:e2f6409
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12810943/HDFS-1312.001.patch
          JIRA Issue HDFS-1312
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs
          uname Linux cf6241f0ffa9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 2800695
          Default Java 1.8.0_91
          shellcheck v0.4.4
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15810/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15810/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. 0 shelldocs 0m 1s Shelldocs was not available. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 15 new or modified test files. 0 mvndep 0m 21s Maven dependency ordering for branch +1 mvninstall 8m 25s trunk passed +1 compile 1m 49s trunk passed +1 checkstyle 0m 48s trunk passed +1 mvnsite 1m 37s trunk passed +1 mvneclipse 0m 31s trunk passed +1 findbugs 3m 27s trunk passed +1 javadoc 1m 22s trunk passed 0 mvndep 0m 8s Maven dependency ordering for patch +1 mvninstall 1m 25s the patch passed +1 compile 1m 30s the patch passed +1 cc 1m 30s the patch passed +1 javac 1m 30s the patch passed -1 checkstyle 0m 44s hadoop-hdfs-project: The patch generated 19 new + 1000 unchanged - 2 fixed = 1019 total (was 1002) +1 mvnsite 1m 28s the patch passed +1 mvneclipse 0m 22s the patch passed +1 shellcheck 0m 13s The patch generated 0 new + 75 unchanged - 1 fixed = 75 total (was 76) -1 whitespace 0m 0s The patch 10 line(s) with tabs. +1 xml 0m 2s The patch has no ill-formed XML file. +1 findbugs 3m 34s the patch passed +1 javadoc 1m 15s the patch passed +1 unit 0m 55s hadoop-hdfs-client in the patch passed. -1 unit 60m 24s hadoop-hdfs in the patch failed. +1 asflicense 0m 26s The patch does not generate ASF License warnings. 92m 51s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.tools.TestHdfsConfigFields Subsystem Report/Notes Docker Image:yetus/hadoop:e2f6409 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12810943/HDFS-1312.001.patch JIRA Issue HDFS-1312 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle cc xml shellcheck shelldocs uname Linux cf6241f0ffa9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2800695 Default Java 1.8.0_91 shellcheck v0.4.4 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15810/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15810/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15810/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          Anu Engineer added a comment -

          Posting the merge patch for diskbalancer against trunk

          Show
          Anu Engineer added a comment - Posting the merge patch for diskbalancer against trunk
          Hide
          Anu Engineer added a comment -

          Update to disk balancer arch and test plan

          Show
          Anu Engineer added a comment - Update to disk balancer arch and test plan
          Hide
          Anu Engineer added a comment -

          Notes from the call on Jan,14th 2016

          Attendees: Andrew Wang, Lei Xu, Colin McCabe, Chris Trezzo, Ming Ma, Arpit Agarwal, Jitendra Pandey, Jing Zhao, Mingliang Liu , Xiaobing Zhou, Anu Engineer and
          others who dialed in (I could only see phone numbers not names, my apologies to people I am missing)

          We discussed the goals of HDFS-1312. Andrew Wang mentioned that HDFS-1804 is used by many customers and it is safe and been used in production for a while. Jitendra pointed out that we still have many customers who are not using HDFS-1804. so he suggested that we focus the discussion on HDFS-1312. We explored the pros and cons of having the planner completely inside the datanode and various other user scenarios. As a team we wanted to make sure that all major scenarios are identified and covered in this review.

          Ming Ma raised an interesting question, which we decided to address - He wanted to find out if running diskbalancer has any quantifiable performance effect. Anu mentioned that since we have bandwidth control, Admins should be able to control it. However, any disk I/O has a cost and we decided to do some performance measurement of disk balancer.

          Andrew Wang raised the question of performance counters and how external tools like Cloudera Manager or Ambari would use disk balancer ? He also explored how we will be able to integrate this tool with other Management tools. We agreed that we will have a set of performance counters exposed via datanode JMX. We also discussed design trade-offs of doing disk balancer inside the datanode vs. outside. We reviewed lots of administrative scenarios and concluded that this tool would be able to address them. We also concluded that tool does not do any cluster-wide planning and all data movement in confined to datanode.

          Colin McCabe brought up a set of interesting questions. He made us think through the scenario of data changing in the datanodes while disk balancer is operational, the impact of future disks with shingled magnetic recording and large disk sizes. He was wondering how long we would take to balance a datanode if it is filled with 6 TB or even 20 TB drives. The conclusion was that if you had large slow disks and lots of data in a node, it would take proportionally more time. For the question of data changing in the datanodes, disk balancer would support a tolerance value, or good enough value for balancing. That is, an administrator can specify that getting 10% close to the expected data distribution is good enough. We also discussed a scenario called "Hot Remove”, just like hot swap, small cluster owners might find it useful to move all data out of hard disk before removing a disk, say to upgrade to a larger size.

          Ming ma pointed out that for them it is easier and simpler to decommission a node. if you have large number of nodes, relying on network is more efficient than micro-managing a datanode. We agreed to that, but for small cluster owners (say less than 5 or 10 nodes), it might make sense to support the ability to move data out of disk. Anu pointed out that disk balancer design does accommodate that capability even though it is not the primary goal of the tool.

          Ming Ma also brought up how twitter runs balancer tool today, it is always being run against Twitter clusters. We discussed if having that balancer as a part of namenode makes sense, but concluded that it was out of scope for HDFS-1312. Andrew mentioned that is the right thing to do in the long run. We also discussed if disk balancer should automatically trigger instead of being an administrator driven task and we were worried that it would trigger and incur I/O when higher priority compute jobs were running in the cluster, hence we decided we are better off letting an admin decide when it is good time to run the disk balancer.

          At the end of review Andrew asked if we can finish this work by end of next month and offered help to make sure that this feature is done sooner.

          Action Items:

          • Analyze performance impact of disk balancer.
          • Add a set of performance counters exposed via datanode JMX.

          Please feel free to comment / correct these notes if I have missed anything. Thank you all for calling in and for having such a great and productive discussion about HDFS-1312.

          Show
          Anu Engineer added a comment - Notes from the call on Jan,14th 2016 Attendees: Andrew Wang, Lei Xu, Colin McCabe, Chris Trezzo, Ming Ma, Arpit Agarwal, Jitendra Pandey, Jing Zhao, Mingliang Liu , Xiaobing Zhou, Anu Engineer and others who dialed in (I could only see phone numbers not names, my apologies to people I am missing) We discussed the goals of HDFS-1312 . Andrew Wang mentioned that HDFS-1804 is used by many customers and it is safe and been used in production for a while. Jitendra pointed out that we still have many customers who are not using HDFS-1804 . so he suggested that we focus the discussion on HDFS-1312 . We explored the pros and cons of having the planner completely inside the datanode and various other user scenarios. As a team we wanted to make sure that all major scenarios are identified and covered in this review. Ming Ma raised an interesting question, which we decided to address - He wanted to find out if running diskbalancer has any quantifiable performance effect. Anu mentioned that since we have bandwidth control, Admins should be able to control it. However, any disk I/O has a cost and we decided to do some performance measurement of disk balancer. Andrew Wang raised the question of performance counters and how external tools like Cloudera Manager or Ambari would use disk balancer ? He also explored how we will be able to integrate this tool with other Management tools. We agreed that we will have a set of performance counters exposed via datanode JMX. We also discussed design trade-offs of doing disk balancer inside the datanode vs. outside. We reviewed lots of administrative scenarios and concluded that this tool would be able to address them. We also concluded that tool does not do any cluster-wide planning and all data movement in confined to datanode. Colin McCabe brought up a set of interesting questions. He made us think through the scenario of data changing in the datanodes while disk balancer is operational, the impact of future disks with shingled magnetic recording and large disk sizes. He was wondering how long we would take to balance a datanode if it is filled with 6 TB or even 20 TB drives. The conclusion was that if you had large slow disks and lots of data in a node, it would take proportionally more time. For the question of data changing in the datanodes, disk balancer would support a tolerance value, or good enough value for balancing. That is, an administrator can specify that getting 10% close to the expected data distribution is good enough. We also discussed a scenario called "Hot Remove”, just like hot swap, small cluster owners might find it useful to move all data out of hard disk before removing a disk, say to upgrade to a larger size. Ming ma pointed out that for them it is easier and simpler to decommission a node. if you have large number of nodes, relying on network is more efficient than micro-managing a datanode. We agreed to that, but for small cluster owners (say less than 5 or 10 nodes), it might make sense to support the ability to move data out of disk. Anu pointed out that disk balancer design does accommodate that capability even though it is not the primary goal of the tool. Ming Ma also brought up how twitter runs balancer tool today, it is always being run against Twitter clusters. We discussed if having that balancer as a part of namenode makes sense, but concluded that it was out of scope for HDFS-1312 . Andrew mentioned that is the right thing to do in the long run. We also discussed if disk balancer should automatically trigger instead of being an administrator driven task and we were worried that it would trigger and incur I/O when higher priority compute jobs were running in the cluster, hence we decided we are better off letting an admin decide when it is good time to run the disk balancer. At the end of review Andrew asked if we can finish this work by end of next month and offered help to make sure that this feature is done sooner. Action Items: Analyze performance impact of disk balancer. Add a set of performance counters exposed via datanode JMX. Please feel free to comment / correct these notes if I have missed anything. Thank you all for calling in and for having such a great and productive discussion about HDFS-1312 .
          Hide
          Chris Trezzo added a comment -

          I will dial into the call as well. Thanks for posting.

          Show
          Chris Trezzo added a comment - I will dial into the call as well. Thanks for posting.
          Hide
          Anu Engineer added a comment -

          Hi Andrew Wang,

          As discussed off-line, let us meet on 14th of Jan, 2016 @ 4:00 - 5:00 PM PST. Here is the meeting info.
          I look forward to chatting with other Apache members who might be interested in this topic.

          Anu Engineer is inviting you to a scheduled Zoom meeting.

          Topic: HDFS-1312 discussion
          Time: Jan 14, 2016 4:00 PM (GMT-8:00) Pacific Time (US and Canada) 
          
          Join from PC, Mac, Linux, iOS or Android: https://hortonworks.zoom.us/j/267578285
           
          Or join by phone:
          
              +1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
              +1 855 880 1246 (US Toll Free)
              +1 888 974 9888 (US Toll Free)
              Meeting ID: 267 578 285 
              International numbers available: https://hortonworks.zoom.us/zoomconference?m=ZlHRHTGmVEKXzM_RaCyzcSnjlk_z3ovm 
          
          Show
          Anu Engineer added a comment - Hi Andrew Wang , As discussed off-line, let us meet on 14th of Jan, 2016 @ 4:00 - 5:00 PM PST. Here is the meeting info. I look forward to chatting with other Apache members who might be interested in this topic. Anu Engineer is inviting you to a scheduled Zoom meeting. Topic: HDFS-1312 discussion Time: Jan 14, 2016 4:00 PM (GMT-8:00) Pacific Time (US and Canada) Join from PC, Mac, Linux, iOS or Android: https://hortonworks.zoom.us/j/267578285 Or join by phone: +1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll) +1 855 880 1246 (US Toll Free) +1 888 974 9888 (US Toll Free) Meeting ID: 267 578 285 International numbers available: https://hortonworks.zoom.us/zoomconference?m=ZlHRHTGmVEKXzM_RaCyzcSnjlk_z3ovm
          Hide
          Anu Engineer added a comment -

          Hi Andrew Wang,

          Thanks for you comments. Here are my thoughts on these issues.

          I don't follow this line of reasoning; don't concerns about using a new feature apply to a hypothetical HDFS-1312 implementation too?

          I think it is related to the risk. Let us look at the worst case scenarios possible with HDFS-1804 and HDFS-1312. With HDFS-1804 it is a cluster wide change, it is always on and any write will always go thru it. HDFS-1804 thus can have a cluster wide impact including impact on various workloads in the cluster.

          However with HDFS-1312, the worst case is that we will take a node off-line. Since it is external tool that operates off-line on a node. Another important difference is that it is not always on, it works and goes away. So the amount of risk to the cluster, especially from an administrators point of view is different with these 2 approaches.

          Why do we lose this? Can't the DN dump this somewhere?

          we can , but then we need to add RPCs in datanode to pull out that data and display the change in the node, whereas in the current approach it is something that we write to the local disk and then compute the diff later against the sources. We don't need a datanode operation.

          This is an interesting point I was not aware of. Is the goal here to do inter-DN moving?

          No, the goal is intra-DN, I was referring to

           hdfs mover 

          not to

           hdfs balancer

          If it's only for intra-DN moving, then it could still live in the DN.

          Completely agree, all block moving code will be in DN.

          This is also why I brought up HDFS-8538. If HDFS-1804 is the default volume choosing policy, we won't see imbalance outside of hotswap.

          Agree, and it is a goal that we should work towards. From the comments in HDFS-8538, it looks like we might have to make some minor tweaks to that before we can commit it. I can look at it after HDFS-1312.

          The point I was trying to make is that HDFS-1804 addresses the imbalance issues besides hotswap, so we eliminate the alerts in the first place. Hotswap is an operation explictly undertaken by the admin, so the admin will know to also run the intra-DN balancer.

          Since we both have made this point many times, I am going to agree with what you are saying. Even if we assume that hotswap or normal swap is the only use case for disk balancing, in a large cluster many disks would have failed. So if a cluster gets a number of disks replaced the current interface would make admins life easier. The admins can replace a bunch of disks on various machines and ask the system to find and fix those nodes. I just think the interface we are building makes the life of admins easier, and takes nothing away from the use cases described by you.

          This is an aspirational goal, but when debugging a prod cluster we almost certainly also want to see the DN log too

          Right now, we have actually met the aspirational goal, we capture the snapshot of the node and that allows us to both debug and simulate what is happening with disk-balancer off-line.

          Would it help to have a phone call about this? We have a lot of points flying around, might be easier to settle this via a higher-bandwidth medium.

          I think that is an excellent idea, would love to chat with you in person. I will setup a meeting and post the meeting info in this JIRA.

          I really appreciate your inputs and thoughtful discussion we are having, hope to speak to you in person soon.

          Show
          Anu Engineer added a comment - Hi Andrew Wang , Thanks for you comments. Here are my thoughts on these issues. I don't follow this line of reasoning; don't concerns about using a new feature apply to a hypothetical HDFS-1312 implementation too? I think it is related to the risk. Let us look at the worst case scenarios possible with HDFS-1804 and HDFS-1312 . With HDFS-1804 it is a cluster wide change, it is always on and any write will always go thru it. HDFS-1804 thus can have a cluster wide impact including impact on various workloads in the cluster. However with HDFS-1312 , the worst case is that we will take a node off-line. Since it is external tool that operates off-line on a node. Another important difference is that it is not always on, it works and goes away. So the amount of risk to the cluster, especially from an administrators point of view is different with these 2 approaches. Why do we lose this? Can't the DN dump this somewhere? we can , but then we need to add RPCs in datanode to pull out that data and display the change in the node, whereas in the current approach it is something that we write to the local disk and then compute the diff later against the sources. We don't need a datanode operation. This is an interesting point I was not aware of. Is the goal here to do inter-DN moving? No, the goal is intra-DN , I was referring to hdfs mover not to hdfs balancer If it's only for intra-DN moving, then it could still live in the DN. Completely agree, all block moving code will be in DN. This is also why I brought up HDFS-8538 . If HDFS-1804 is the default volume choosing policy, we won't see imbalance outside of hotswap. Agree, and it is a goal that we should work towards. From the comments in HDFS-8538 , it looks like we might have to make some minor tweaks to that before we can commit it. I can look at it after HDFS-1312 . The point I was trying to make is that HDFS-1804 addresses the imbalance issues besides hotswap, so we eliminate the alerts in the first place. Hotswap is an operation explictly undertaken by the admin, so the admin will know to also run the intra-DN balancer. Since we both have made this point many times, I am going to agree with what you are saying. Even if we assume that hotswap or normal swap is the only use case for disk balancing, in a large cluster many disks would have failed. So if a cluster gets a number of disks replaced the current interface would make admins life easier. The admins can replace a bunch of disks on various machines and ask the system to find and fix those nodes. I just think the interface we are building makes the life of admins easier, and takes nothing away from the use cases described by you. This is an aspirational goal, but when debugging a prod cluster we almost certainly also want to see the DN log too Right now, we have actually met the aspirational goal, we capture the snapshot of the node and that allows us to both debug and simulate what is happening with disk-balancer off-line. Would it help to have a phone call about this? We have a lot of points flying around, might be easier to settle this via a higher-bandwidth medium. I think that is an excellent idea, would love to chat with you in person. I will setup a meeting and post the meeting info in this JIRA. I really appreciate your inputs and thoughtful discussion we are having, hope to speak to you in person soon.
          Hide
          Andrew Wang added a comment -

          Hi Anu, some replies:

          Generally administrators are wary of enabling a feature like HDFS-1804 in a production cluster. For new clusters it is more easier but for existing production clusters assuming the existence of HDFS-1804 is not realistic.

          I don't follow this line of reasoning; don't concerns about using a new feature apply to a hypothetical HDFS-1312 implementation too?

          HDFS-1804 also was fixed in 2.1.0, so almost everyone should have it available. It's also been in use for years, so it's pretty stable.

          we do lose one the critical feature of the tool, that is ability to report what we did to the machine

          Why do we lose this? Can't the DN dump this somewhere?

          We wanted to merge mover into this engine later...

          This is an interesting point I was not aware of. Is the goal here to do inter-DN moving? If so, we have a long-standing issue with inter-DN balancing, which is that the balancer as an external process is not aware of the NN's block placement policies, leading to placement violations. This is something Ming Ma and Chris Trezzo brought up; if we're doing a rewrite of this functionality, it should probably be in the NN.

          If it's only for intra-DN moving, then it could still live in the DN.

          Two issues with that, one there are lots of customers without HDFS-1804, and HDFS-1804 is just an option that user can choose.

          Almost everyone is running a version of HDFS with HDFS-1804 these days. As I said in my previous comment, if a cluster is commonly hitting imbalance, enabling HDFS-1804 should be the first step since a) it's already available and b) it avoids the imbalance in the first place, which better conserves IO bandwidth.

          This is also why I brought up HDFS-8538. If HDFS-1804 is the default volume choosing policy, we won't see imbalance outside of hotswap.

          Getting an alert due to low space on disk from datanode is very reactive.... it is common enough problem that I think it should be solved at HDFS level.

          The point I was trying to make is that HDFS-1804 addresses the imbalance issues besides hotswap, so we eliminate the alerts in the first place. Hotswap is an operation explictly undertaken by the admin, so the admin will know to also run the intra-DN balancer. There's no monitoring system in the loop.

          I prefer to debug by looking at my local directory instead of ssh-ing into a datanode...

          This is an aspirational goal, but when debugging a prod cluster we almost certainly also want to see the DN log too, which is local to the DN. Cluster management systems also make log collection pretty easy, so this seems minor.

          Would it help to have a phone call about this? We have a lot of points flying around, might be easier to settle this via a higher-bandwidth medium.

          Show
          Andrew Wang added a comment - Hi Anu, some replies: Generally administrators are wary of enabling a feature like HDFS-1804 in a production cluster. For new clusters it is more easier but for existing production clusters assuming the existence of HDFS-1804 is not realistic. I don't follow this line of reasoning; don't concerns about using a new feature apply to a hypothetical HDFS-1312 implementation too? HDFS-1804 also was fixed in 2.1.0, so almost everyone should have it available. It's also been in use for years, so it's pretty stable. we do lose one the critical feature of the tool, that is ability to report what we did to the machine Why do we lose this? Can't the DN dump this somewhere? We wanted to merge mover into this engine later... This is an interesting point I was not aware of. Is the goal here to do inter-DN moving? If so, we have a long-standing issue with inter-DN balancing, which is that the balancer as an external process is not aware of the NN's block placement policies, leading to placement violations. This is something Ming Ma and Chris Trezzo brought up; if we're doing a rewrite of this functionality, it should probably be in the NN. If it's only for intra-DN moving, then it could still live in the DN. Two issues with that, one there are lots of customers without HDFS-1804 , and HDFS-1804 is just an option that user can choose. Almost everyone is running a version of HDFS with HDFS-1804 these days. As I said in my previous comment, if a cluster is commonly hitting imbalance, enabling HDFS-1804 should be the first step since a) it's already available and b) it avoids the imbalance in the first place, which better conserves IO bandwidth. This is also why I brought up HDFS-8538 . If HDFS-1804 is the default volume choosing policy, we won't see imbalance outside of hotswap. Getting an alert due to low space on disk from datanode is very reactive.... it is common enough problem that I think it should be solved at HDFS level. The point I was trying to make is that HDFS-1804 addresses the imbalance issues besides hotswap, so we eliminate the alerts in the first place. Hotswap is an operation explictly undertaken by the admin, so the admin will know to also run the intra-DN balancer. There's no monitoring system in the loop. I prefer to debug by looking at my local directory instead of ssh-ing into a datanode... This is an aspirational goal, but when debugging a prod cluster we almost certainly also want to see the DN log too, which is local to the DN. Cluster management systems also make log collection pretty easy, so this seems minor. Would it help to have a phone call about this? We have a lot of points flying around, might be easier to settle this via a higher-bandwidth medium.
          Hide
          Anu Engineer added a comment -

          Hi Andrew Wang Thanks for the quick response. I was thinking about what you said and I think the the disconnect we are having is because you are assuming that HDFS-1804 is always available on the HDFS clusters, but for many customers that is not true.

          Have these customers tried out HDFS-1804? We've had users complain about imbalance before too, and after enabling HDFS-1804, no further issues.

          Generally administrators are wary of enabling a feature like HDFS-1804 in a production cluster. For new clusters it is more easier but for existing production clusters assuming the existence of HDFS-1804 is not realistic.

          If you're interested, feel free to take HDFS-8538 from me. I really think it'll fix the majority of imbalance issues outside of hotswap.

          Thank you for the offer. I can certainly work on HDFS-8538 after HDFS-1312. I think of HDFS-1804, HDFS-8538 and HDFS-1312 as part of the solution to the same problem. Just attacking it from different angles, without all three some part of HDFS users will always be left out.

          When I mentioned removing the discover phase, I meant the NN communication. Here, the DN just probes its own volume information. Does it need to talk to the NN for anything else?

          1. I am not able to see any particular advantage is doing planning inside the datanode. However with that approach we do lose one the critical feature of the tool, that is ability to report what we did to the machine, capturing the "before state" is much more complex. I agree that users can manually capture this info before and after, but that is an extra administrative burden. With the current approach, we record the information of a datanode before and we make it easy to compare the state once we are done.
          2. Also We wanted to merge mover into this engine later, with the current approach what we build inside datanode is a simple block mover, or to be precise an RPC interface to the existing mover's block interface. You can feed it any move commands you like, which provides better composability. With what you are suggesting we will lose that flexibility.
          3. Less complexity inside datanode, planner code never needs to run inside datanode, it is a piece of code that plans a set of moves, why would we want to run it inside datanode ?.
          4. Since none of our tools currently report the disk level data distribution, without talking to Namenode it is not possible to find which nodes are imbalanced. I know that you are arguing that it will never happen, if all customers use HDFS-1804. Two issues with that, one there are lots of customers without HDFS-1804, and HDFS-1804 is just an option that user can choose. Since it is configurable, it is arguable that we will always have customers without HDFS-1804. The current architecture addresses the needs of both group of users. In that sense current architecture is a better or more encompassing architecture.

          Cluster-wide disk information is already handled by monitoring tools, no? The admin gets the Ganglia alert saying some node is imbalanced, admin triggers intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I don't think adding our own monitoring of the same information helps, when Ganglia etc. are already available, in-use, and understood by admins.

          Please correct me if I am missing something here, Getting an alert due to low space on disk from datanode is very reactive. With the current disk balancer design we are assuming that disk balancer tool can be run to address and fix any issue you have in the cluster. You could argue that admins can write some scripts to monitor this issue using ganglia command line, but it is common enough problem that I think it should be solved at HDFS level.

          Here are two use cases that disk balancer addresses. First one is discovering nodes with potential issues and the second one is auto fixing those issues. This is very similar to current balancer.

          1. Scenario 1: Admin can run

           hdfs diskbalancer -top 100 

          , and viola! we print out the top 100 hundred nodes that is having a problem. Let us say that admin now wants to look closely at the node and find out distribution on individual disks, he can now do that via. disk balancer, ssh or ganglia.

          2. Scenario 2 : Admin wants not to be bothered with this balancing act at all, in reality he is thinking why doesn't HDFS just take care of this(I know HDFS-1804 is addressing that, but again we are talking about cluster which does not have it enabled.) and in that case we will let the admin run

           hdfs diskbalancer -top 10 -balance  

          , this allows the admin to run disk balancer just like current balancer, without having to worry about what is happening or measuring each node. With gangalia a bunch of nodes will fire alerts, admin needs to copy the address of each datanode and give it to disk balancer. I think the current flow of disk balancer makes it easier to use.

          I don't think this conflicts with the debuggability goal. The DN can dump the Info object (and even the Plan) object) if requested, to the log or somewhere in a data dir.

          Well, it is debuggable, but assuming that I am the one who will be called on to debug this, I prefer to debug by looking at my local directory instead of ssh-ing into a datanode. I think of writing to local directory as a gift I am making to my future self . Plus as mentioned earlier the other use case where we want to report to user what our tool did, fetching this data out of datanode's log directory is hard(may be another RPC to fetch it ??).

          Adding a note that says "we use the existing moveBlockAcrossStorage method" is a great answer.

          I will update the design doc with this info. Thanks for your suggestions.

          Show
          Anu Engineer added a comment - Hi Andrew Wang Thanks for the quick response. I was thinking about what you said and I think the the disconnect we are having is because you are assuming that HDFS-1804 is always available on the HDFS clusters, but for many customers that is not true. Have these customers tried out HDFS-1804 ? We've had users complain about imbalance before too, and after enabling HDFS-1804 , no further issues. Generally administrators are wary of enabling a feature like HDFS-1804 in a production cluster. For new clusters it is more easier but for existing production clusters assuming the existence of HDFS-1804 is not realistic. If you're interested, feel free to take HDFS-8538 from me. I really think it'll fix the majority of imbalance issues outside of hotswap. Thank you for the offer. I can certainly work on HDFS-8538 after HDFS-1312 . I think of HDFS-1804 , HDFS-8538 and HDFS-1312 as part of the solution to the same problem. Just attacking it from different angles, without all three some part of HDFS users will always be left out. When I mentioned removing the discover phase, I meant the NN communication. Here, the DN just probes its own volume information. Does it need to talk to the NN for anything else? I am not able to see any particular advantage is doing planning inside the datanode. However with that approach we do lose one the critical feature of the tool, that is ability to report what we did to the machine, capturing the "before state" is much more complex. I agree that users can manually capture this info before and after, but that is an extra administrative burden. With the current approach, we record the information of a datanode before and we make it easy to compare the state once we are done. Also We wanted to merge mover into this engine later, with the current approach what we build inside datanode is a simple block mover, or to be precise an RPC interface to the existing mover's block interface. You can feed it any move commands you like, which provides better composability. With what you are suggesting we will lose that flexibility. Less complexity inside datanode, planner code never needs to run inside datanode, it is a piece of code that plans a set of moves, why would we want to run it inside datanode ?. Since none of our tools currently report the disk level data distribution, without talking to Namenode it is not possible to find which nodes are imbalanced. I know that you are arguing that it will never happen, if all customers use HDFS-1804 . Two issues with that, one there are lots of customers without HDFS-1804 , and HDFS-1804 is just an option that user can choose. Since it is configurable, it is arguable that we will always have customers without HDFS-1804 . The current architecture addresses the needs of both group of users. In that sense current architecture is a better or more encompassing architecture. Cluster-wide disk information is already handled by monitoring tools, no? The admin gets the Ganglia alert saying some node is imbalanced, admin triggers intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I don't think adding our own monitoring of the same information helps, when Ganglia etc. are already available, in-use, and understood by admins. Please correct me if I am missing something here, Getting an alert due to low space on disk from datanode is very reactive. With the current disk balancer design we are assuming that disk balancer tool can be run to address and fix any issue you have in the cluster. You could argue that admins can write some scripts to monitor this issue using ganglia command line, but it is common enough problem that I think it should be solved at HDFS level. Here are two use cases that disk balancer addresses. First one is discovering nodes with potential issues and the second one is auto fixing those issues. This is very similar to current balancer. 1. Scenario 1: Admin can run hdfs diskbalancer -top 100 , and viola! we print out the top 100 hundred nodes that is having a problem. Let us say that admin now wants to look closely at the node and find out distribution on individual disks, he can now do that via. disk balancer, ssh or ganglia. 2. Scenario 2 : Admin wants not to be bothered with this balancing act at all, in reality he is thinking why doesn't HDFS just take care of this(I know HDFS-1804 is addressing that, but again we are talking about cluster which does not have it enabled.) and in that case we will let the admin run hdfs diskbalancer -top 10 -balance , this allows the admin to run disk balancer just like current balancer, without having to worry about what is happening or measuring each node. With gangalia a bunch of nodes will fire alerts, admin needs to copy the address of each datanode and give it to disk balancer. I think the current flow of disk balancer makes it easier to use. I don't think this conflicts with the debuggability goal. The DN can dump the Info object (and even the Plan) object) if requested, to the log or somewhere in a data dir. Well, it is debuggable, but assuming that I am the one who will be called on to debug this, I prefer to debug by looking at my local directory instead of ssh-ing into a datanode. I think of writing to local directory as a gift I am making to my future self . Plus as mentioned earlier the other use case where we want to report to user what our tool did, fetching this data out of datanode's log directory is hard(may be another RPC to fetch it ??). Adding a note that says "we use the existing moveBlockAcrossStorage method" is a great answer. I will update the design doc with this info. Thanks for your suggestions.
          Hide
          Andrew Wang added a comment -

          Hi Anu, thanks for the reply,

          Our own experience from the field is that many customers routinely run into this issue, with and without new drives being added, so we are tackling both use cases.

          Have these customers tried out HDFS-1804? We've had users complain about imbalance before too, and after enabling HDFS-1804, no further issues. This is why I'm trying to separate the two usecases; heterogeneous disks are better addressed by HDFS-1804 since it works automatically, leaving hotswap to this JIRA (HDFS-1312).

          If you're interested, feel free to take HDFS-8538 from me. I really think it'll fix the majority of imbalance issues outside of hotswap.

          (self-quote) I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion.

          I think I was too brief before, let me expand a little. I imagine basically everything (discover, planning, execute) happening in the DN:

          1. Client sends RPC to DN telling it to balance with some parameters
          2. DN examines its volumes, constructs some Info object to hold the data
          3. DN thread calls Planner passing the Info object, which outputs a Plan
          4. Plan is queued at DN executor pool, which does the moves

          Attempting to address your points one by one:

          1. When I mentioned removing the discover phase, I meant the NN communication. Here, the DN just probes its own volume information. Does it need to talk to the NN for anything else?
          2. Assuming no need for NN communication, there's no new code. The new RPCs would be one to start balancing and one to monitor ongoing balancing, the rest of the communication happens between DN threads.
          3. Cluster-wide disk information is already handled by monitoring tools, no? The admin gets the Ganglia alert saying some node is imbalanced, admin triggers intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I don't think adding our own monitoring of the same information helps, when Ganglia etc. are already available, in-use, and understood by admins.
          4. I don't think this conflicts with the debuggability goal. The DN can dump the Info object (and even the Plan) object) if requested, to the log or somewhere in a data dir. Then we can pass it into a unit test to debug. This unit test doesn't need to be a minicluster either, if we write the Planner correctly the Info should encapsulate all the state and we're just running an algorithm to make a Plan. The planner being inside the DN doesn't change this.

          Thanks for the pointer to the mover logic, wasn't aware we had that. I asked about this since the proposal doc in 4.2 says "copy block from A to B and verify". Adding a note that says "we use the existing moveBlockAcrossStorage method" is a great answer.

          Show
          Andrew Wang added a comment - Hi Anu, thanks for the reply, Our own experience from the field is that many customers routinely run into this issue, with and without new drives being added, so we are tackling both use cases. Have these customers tried out HDFS-1804 ? We've had users complain about imbalance before too, and after enabling HDFS-1804 , no further issues. This is why I'm trying to separate the two usecases; heterogeneous disks are better addressed by HDFS-1804 since it works automatically, leaving hotswap to this JIRA ( HDFS-1312 ). If you're interested, feel free to take HDFS-8538 from me. I really think it'll fix the majority of imbalance issues outside of hotswap. (self-quote) I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. I think I was too brief before, let me expand a little. I imagine basically everything (discover, planning, execute) happening in the DN: Client sends RPC to DN telling it to balance with some parameters DN examines its volumes, constructs some Info object to hold the data DN thread calls Planner passing the Info object, which outputs a Plan Plan is queued at DN executor pool, which does the moves Attempting to address your points one by one: When I mentioned removing the discover phase, I meant the NN communication. Here, the DN just probes its own volume information. Does it need to talk to the NN for anything else? Assuming no need for NN communication, there's no new code. The new RPCs would be one to start balancing and one to monitor ongoing balancing, the rest of the communication happens between DN threads. Cluster-wide disk information is already handled by monitoring tools, no? The admin gets the Ganglia alert saying some node is imbalanced, admin triggers intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I don't think adding our own monitoring of the same information helps, when Ganglia etc. are already available, in-use, and understood by admins. I don't think this conflicts with the debuggability goal. The DN can dump the Info object (and even the Plan ) object) if requested, to the log or somewhere in a data dir. Then we can pass it into a unit test to debug. This unit test doesn't need to be a minicluster either, if we write the Planner correctly the Info should encapsulate all the state and we're just running an algorithm to make a Plan . The planner being inside the DN doesn't change this. Thanks for the pointer to the mover logic, wasn't aware we had that. I asked about this since the proposal doc in 4.2 says "copy block from A to B and verify". Adding a note that says "we use the existing moveBlockAcrossStorage method" is a great answer.
          Hide
          Anu Engineer added a comment -

          Andrew Wang I really appreciate you taking time out to read the proposal and provide such detailed feedback.

          I really like separating planning from execution, since it'll make the unit tests actual unit tests (no minicluster!). This is a real issue with the existing balancer, the tests take forever to run and don't converge reliably.

          Thank you, that was the intent.

          Section 2: HDFS-1804 has been around for a while and used successfully by many of our users, so "lack of real world data or adoption" is not entirely correct. We've even considered making it the default, see HDFS-8538 where the consensus was that we could do this if we add some additional throttling.

          Thanks for correcting me and reference to HDFS-8538. I know folks at Cloudera write some excellent engineering blog posts and papers, I would love to read your experiences with using HDFS-1804.

          IMO the usecase to focus on is the addition of fresh drives, particularly in the context of hotswap. I'm unconvinced that intra-node imbalance happens naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a cluster is commonly suffering from intra-DN imbalance (e.g. from differently sized disks on a node). This means we should only see intra-node imbalance on admin action like adding a new drive; a singular, administrator-triggered operation.

          Our own experience from the field is that many customers routinely run into this issue, with and without new drives being added, so we are tackling both use cases.

          However, I don't understand the need for cluster-wide reporting and orchestration.

          Just to make sure we are in same page, there is no cluster-wide orchestration. The cluster-wide reporting allows admins to see which nodes need disk balancing. Not all clusters are running HDFS-1804, and hence the ability to discover which machines need disk-balancing is useful for a large set of customers.

          I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion.

          That is how it is, HDFS-1312 is not complete and you will see all the data movement is indeed inside the datanode.

          Namely, this avoids the Discover step, simplifying things. There's also no global planning step. What I'm envisioning is a user experience where admin just points it at a DN, like:

          hdfs balancer -volumes -datanode 1.2.3.4:50070
          

          [Took the liberty to re-order some of your comments since both of these are best answered together]

          Completely agree with the proposed command, in fact that is one of the commands that will be part of the tool. As you pointed out that is the simplest use case.

          The reason for discovery is due to many considerations.

          1. This discover phase is needed even if we have to process a single node, we have to read the information about that node. I understand that you are pointing out that we may not need the info for the whole cluster, please see my next point.
          2. Discover phase makes coding simpler since we are able to rely on current balancer code in NameNodeConnector, and avoids us having to write any new code. For the approach that you are suggesting we will have to add more RPCs to datanode and for a rarely used administrative tool it seemed like an overkill, when balancer already provided a way for us to do it. if you look at current code in HDFS-1312, you will see that discover is merely a pass-thru to the Balancer#NameNodeConnector.
          3. With discover approach, we can take a snapshot of the nodes and cluster, that allows us to report to the admin what changes were done by us after the move. This is pretty useful. Since we have cluster-wide data with us it also allows us to report to an administrator which nodes need his/her attention (this is just a sort and print). As I mentioned earlier, unfortunately there is a large number of customers that still do not use HDFS-1804 and it is a beneficial feature for them.
          4. Last but most important from my personal point of view, this allows testing and error reporting for disk balancer to much more easier. Let us say we find a bug in disk-balancer, a customer could just provide the disk-balancer cluster json discovered by diskbalancer and we can debug the issue off-line. During the the development phase, that is how I have been testing disk balancer, by taking a snapshot of real clusters and then feeding that data back into disk-balancer via a connector called JsonNodeConnector.

          As I said earlier, the most generic use case I have in mind is the one you already described, where the user wants to just point the balancer at a datanode, and we will support that use case.

          There's also no global planning step, I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. Basically we'd send an RPC to tell the DN to balance itself (with parameters), and then poll another RPC to watch the status and wait until it's done.

          We have to have a plan – either we do that inside the datanode or do it outside and submit the plan to the datanode. With doing it outside we are able to test the planning phase independently. Please look at the test cases in TestPlanner.java we are able to test for both scale and correctness of the planner easily, with moving that into datanode we will get into the issue that we can only test a plan with miniDFSCluster, which we both agree is painful.

          This also has an added benefit that an admin gets an opportunity to review the plan if needed, and supports the other use case where you can use this tool as block mover. Eventually we are hoping that this code will merge with mover.

          On the topic of the actual balancing, how do we atomically move the block in the presence of failures? Right now the NN expects only one replica per DN, so if the same replica is on multiple volumes of the DN, we could run into issues. See related (quite serious) issues like HDFS-7443 and HDFS-7960. I think we can do some tricks with a temp filename and rename, but this procedure should be carefully explained.

          Thanks for the pointers. We have very little new code here, we rely on the mover logic, in fact what we have a thin wrapper over FsDatasetSpi#moveBlockAcrossStorage. But the concerns you raise are quite valid, and I will test to make sure that we don't regress on the data node side. We will cover all these scenarios and build some test cases specifically to replicate the error conditions that you are pointing out. Right now, all we do is take a lock at the FsDatasetSpi level (since we want mutual exclusion with DirectoryScanner) and rely on moveBlockAcrossStorage. I will also add more details to Architecture_and_testplan.pdf, if it is not well explained in that document.

          Show
          Anu Engineer added a comment - Andrew Wang I really appreciate you taking time out to read the proposal and provide such detailed feedback. I really like separating planning from execution, since it'll make the unit tests actual unit tests (no minicluster!). This is a real issue with the existing balancer, the tests take forever to run and don't converge reliably. Thank you, that was the intent. Section 2: HDFS-1804 has been around for a while and used successfully by many of our users, so "lack of real world data or adoption" is not entirely correct. We've even considered making it the default, see HDFS-8538 where the consensus was that we could do this if we add some additional throttling. Thanks for correcting me and reference to HDFS-8538 . I know folks at Cloudera write some excellent engineering blog posts and papers, I would love to read your experiences with using HDFS-1804 . IMO the usecase to focus on is the addition of fresh drives, particularly in the context of hotswap. I'm unconvinced that intra-node imbalance happens naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a cluster is commonly suffering from intra-DN imbalance (e.g. from differently sized disks on a node). This means we should only see intra-node imbalance on admin action like adding a new drive; a singular, administrator-triggered operation. Our own experience from the field is that many customers routinely run into this issue, with and without new drives being added, so we are tackling both use cases. However, I don't understand the need for cluster-wide reporting and orchestration. Just to make sure we are in same page, there is no cluster-wide orchestration. The cluster-wide reporting allows admins to see which nodes need disk balancing. Not all clusters are running HDFS-1804 , and hence the ability to discover which machines need disk-balancing is useful for a large set of customers. I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. That is how it is, HDFS-1312 is not complete and you will see all the data movement is indeed inside the datanode. Namely, this avoids the Discover step, simplifying things. There's also no global planning step. What I'm envisioning is a user experience where admin just points it at a DN, like: hdfs balancer -volumes -datanode 1.2.3.4:50070 [Took the liberty to re-order some of your comments since both of these are best answered together] Completely agree with the proposed command, in fact that is one of the commands that will be part of the tool. As you pointed out that is the simplest use case. The reason for discovery is due to many considerations. This discover phase is needed even if we have to process a single node, we have to read the information about that node. I understand that you are pointing out that we may not need the info for the whole cluster, please see my next point. Discover phase makes coding simpler since we are able to rely on current balancer code in NameNodeConnector , and avoids us having to write any new code. For the approach that you are suggesting we will have to add more RPCs to datanode and for a rarely used administrative tool it seemed like an overkill, when balancer already provided a way for us to do it. if you look at current code in HDFS-1312 , you will see that discover is merely a pass-thru to the Balancer#NameNodeConnector . With discover approach, we can take a snapshot of the nodes and cluster, that allows us to report to the admin what changes were done by us after the move. This is pretty useful. Since we have cluster-wide data with us it also allows us to report to an administrator which nodes need his/her attention (this is just a sort and print). As I mentioned earlier, unfortunately there is a large number of customers that still do not use HDFS-1804 and it is a beneficial feature for them. Last but most important from my personal point of view, this allows testing and error reporting for disk balancer to much more easier. Let us say we find a bug in disk-balancer, a customer could just provide the disk-balancer cluster json discovered by diskbalancer and we can debug the issue off-line. During the the development phase, that is how I have been testing disk balancer, by taking a snapshot of real clusters and then feeding that data back into disk-balancer via a connector called JsonNodeConnector . As I said earlier, the most generic use case I have in mind is the one you already described, where the user wants to just point the balancer at a datanode, and we will support that use case. There's also no global planning step, I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. Basically we'd send an RPC to tell the DN to balance itself (with parameters), and then poll another RPC to watch the status and wait until it's done. We have to have a plan – either we do that inside the datanode or do it outside and submit the plan to the datanode. With doing it outside we are able to test the planning phase independently. Please look at the test cases in TestPlanner.java we are able to test for both scale and correctness of the planner easily, with moving that into datanode we will get into the issue that we can only test a plan with miniDFSCluster, which we both agree is painful. This also has an added benefit that an admin gets an opportunity to review the plan if needed, and supports the other use case where you can use this tool as block mover. Eventually we are hoping that this code will merge with mover. On the topic of the actual balancing, how do we atomically move the block in the presence of failures? Right now the NN expects only one replica per DN, so if the same replica is on multiple volumes of the DN, we could run into issues. See related (quite serious) issues like HDFS-7443 and HDFS-7960 . I think we can do some tricks with a temp filename and rename, but this procedure should be carefully explained. Thanks for the pointers. We have very little new code here, we rely on the mover logic, in fact what we have a thin wrapper over FsDatasetSpi#moveBlockAcrossStorage . But the concerns you raise are quite valid, and I will test to make sure that we don't regress on the data node side. We will cover all these scenarios and build some test cases specifically to replicate the error conditions that you are pointing out. Right now, all we do is take a lock at the FsDatasetSpi level (since we want mutual exclusion with DirectoryScanner ) and rely on moveBlockAcrossStorage. I will also add more details to Architecture_and_testplan.pdf, if it is not well explained in that document.
          Hide
          Andrew Wang added a comment -

          Hi Anu, thanks for picking up this JIRA, it's a long-standing issue. It's clear from the docs that you've put a lot of thought into the problem, and I hope that some of the same ideas could someday be carried over to the existing balancer too. I really like separating planning from execution, since it'll make the unit tests actual unit tests (no minicluster!). This is a real issue with the existing balancer, the tests take forever to run and don't converge reliably.

          I did have a few comments though, hoping we can get clarity on the usecases and some of the resulting design decisions:

          Section 2: HDFS-1804 has been around for a while and used successfully by many of our users, so "lack of real world data or adoption" is not entirely correct. We've even considered making it the default, see HDFS-8538 where the consensus was that we could do this if we add some additional throttling.

          IMO the usecase to focus on is the addition of fresh drives, particularly in the context of hotswap. I'm unconvinced that intra-node imbalance happens naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a cluster is commonly suffering from intra-DN imbalance (e.g. from differently sized disks on a node). This means we should only see intra-node imbalance on admin action like adding a new drive; a singular, administrator-triggered operation.

          With that in mind, I wonder if we can limit the scope of this effort. I like the idea of an online balancer; the previous hacky scripts required downtime, which is unacceptable with hotswap. However, I don't understand the need for cluster-wide reporting and orchestration. With HDFS-1804, intra-node imbalance should only happen when an admin adding a new drive, the admin can also know to trigger the intra-DN balancer when doing this. If they forget, Ganglia will light up and remind them.

          What I'm envisioning is a user experience where admin just points it at a DN, like:

          hdfs balancer -volumes -datanode 1.2.3.4:50070
          

          Namely, this avoids the Discover step, simplifying things. There's also no global planning step, I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. Basically we'd send an RPC to tell the DN to balance itself (with parameters), and then poll another RPC to watch the status and wait until it's done.

          On the topic of the actual balancing, how do we atomically move the block in the presence of failures? Right now the NN expects only one replica per DN, so if the same replica is on multiple volumes of the DN, we could run into issues. See related (quite serious) issues like HDFS-7443 and HDFS-7960. I think we can do some tricks with a temp filename and rename, but this procedure should be carefully explained.

          Thanks,
          Andrew

          Show
          Andrew Wang added a comment - Hi Anu, thanks for picking up this JIRA, it's a long-standing issue. It's clear from the docs that you've put a lot of thought into the problem, and I hope that some of the same ideas could someday be carried over to the existing balancer too. I really like separating planning from execution, since it'll make the unit tests actual unit tests (no minicluster!). This is a real issue with the existing balancer, the tests take forever to run and don't converge reliably. I did have a few comments though, hoping we can get clarity on the usecases and some of the resulting design decisions: Section 2: HDFS-1804 has been around for a while and used successfully by many of our users, so "lack of real world data or adoption" is not entirely correct. We've even considered making it the default, see HDFS-8538 where the consensus was that we could do this if we add some additional throttling. IMO the usecase to focus on is the addition of fresh drives, particularly in the context of hotswap. I'm unconvinced that intra-node imbalance happens naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a cluster is commonly suffering from intra-DN imbalance (e.g. from differently sized disks on a node). This means we should only see intra-node imbalance on admin action like adding a new drive; a singular, administrator-triggered operation. With that in mind, I wonder if we can limit the scope of this effort. I like the idea of an online balancer; the previous hacky scripts required downtime, which is unacceptable with hotswap. However, I don't understand the need for cluster-wide reporting and orchestration. With HDFS-1804 , intra-node imbalance should only happen when an admin adding a new drive, the admin can also know to trigger the intra-DN balancer when doing this. If they forget, Ganglia will light up and remind them. What I'm envisioning is a user experience where admin just points it at a DN, like: hdfs balancer -volumes -datanode 1.2.3.4:50070 Namely, this avoids the Discover step, simplifying things. There's also no global planning step, I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. Basically we'd send an RPC to tell the DN to balance itself (with parameters), and then poll another RPC to watch the status and wait until it's done. On the topic of the actual balancing, how do we atomically move the block in the presence of failures? Right now the NN expects only one replica per DN, so if the same replica is on multiple volumes of the DN, we could run into issues. See related (quite serious) issues like HDFS-7443 and HDFS-7960 . I think we can do some tricks with a temp filename and rename, but this procedure should be carefully explained. Thanks, Andrew
          Hide
          Anu Engineer added a comment -

          Lei (Eddy) Xu Very pertinent questions. I will probably steal these questions and add them to the diskbalancer documentation.

          Will disk balancer be a daemon or a CLI tool, that waits the process to finishes?

          It is a combination of both. Let me explain in greater detail. We rely on datanodes to do actual copy, and datanodes will run a background thread - if there is a disk balancing job to be run. The disk balancer job is a list of statements which contain a source volume, destination volume and number of bytes - in the code called a MoveStep org.apache.hadoop.hdfs.server.diskbalancer.planner.MoveStep. The CLI will read the current state of the cluster and compute a set of MoveSteps, which is called a Plan. The advantage of this approach is that it allows the administrator to review the plan if needed before executing it against the datanode. This plan can also be persisted to a file if needed or just submitted to a DataNode via submitDiskbalancerPlan RPC - HDFS-9588.

          The datanode takes this plan and executes these moves in the background, so there is no new daemon, and the cli tool is really used in the planning and generation of data movement for each datanode. This is quite similar to the balancer, but instead of sending one RPC at a time we submit all of them together to the datanode - since our moves are pretty much self contained within a datanode. To sum up, we have a CLI tool that will submit a plan but does not wait for plan to finish executing and the datanode will do the moves itself.

          Where is the Planner executed? A DiskBalancer daemon / CLI tool or NN?

          There is a background thread just like DirectoryScanner that executes the plan.

          When copying a replica from one volume to another, how to prevent it to have concurrent issues with the DirectoryScanner in background?

          Great question, this is one of the motivators for plan executor to become a thread inside datanode, so that we don't get into issues with concurrent access with DirectoryScanner. We take the same locks as DirectoryScanner when we move the blocks, Also we don't have any new code for these moves, we rely on the existing mover code path inside the datanode to achieve the actual move.

          Is there a limitation of how many such disk balancer jobs can run in the cluster?

          Yes, one per datanode, if submit Rpc is called when a job is executing we will reject the new submission. Please look at HDFS-9588 , ClientDatanodeProtocol.proto#SubmitDiskBalancerPlanResponseProto#submitResults enum to see the set of errors that we return. One of them is PLAN_ALREADY_IN_PROGRESS which is the error you would see if you tried to submit another job to a data node that is already executing a job.

          Could other job queries the status of a running job?

          Yes, I will post a patch soon which will support the QueryPlan RPC which will return the current status of an executing or last executed plan.

          Show
          Anu Engineer added a comment - Lei (Eddy) Xu Very pertinent questions. I will probably steal these questions and add them to the diskbalancer documentation. Will disk balancer be a daemon or a CLI tool, that waits the process to finishes? It is a combination of both. Let me explain in greater detail. We rely on datanodes to do actual copy, and datanodes will run a background thread - if there is a disk balancing job to be run. The disk balancer job is a list of statements which contain a source volume, destination volume and number of bytes - in the code called a MoveStep org.apache.hadoop.hdfs.server.diskbalancer.planner.MoveStep . The CLI will read the current state of the cluster and compute a set of MoveSteps, which is called a Plan. The advantage of this approach is that it allows the administrator to review the plan if needed before executing it against the datanode. This plan can also be persisted to a file if needed or just submitted to a DataNode via submitDiskbalancerPlan RPC - HDFS-9588 . The datanode takes this plan and executes these moves in the background, so there is no new daemon, and the cli tool is really used in the planning and generation of data movement for each datanode. This is quite similar to the balancer, but instead of sending one RPC at a time we submit all of them together to the datanode - since our moves are pretty much self contained within a datanode. To sum up, we have a CLI tool that will submit a plan but does not wait for plan to finish executing and the datanode will do the moves itself. Where is the Planner executed? A DiskBalancer daemon / CLI tool or NN? There is a background thread just like DirectoryScanner that executes the plan. When copying a replica from one volume to another, how to prevent it to have concurrent issues with the DirectoryScanner in background? Great question, this is one of the motivators for plan executor to become a thread inside datanode, so that we don't get into issues with concurrent access with DirectoryScanner . We take the same locks as DirectoryScanner when we move the blocks, Also we don't have any new code for these moves, we rely on the existing mover code path inside the datanode to achieve the actual move. Is there a limitation of how many such disk balancer jobs can run in the cluster? Yes, one per datanode, if submit Rpc is called when a job is executing we will reject the new submission. Please look at HDFS-9588 , ClientDatanodeProtocol.proto#SubmitDiskBalancerPlanResponseProto#submitResults enum to see the set of errors that we return. One of them is PLAN_ALREADY_IN_PROGRESS which is the error you would see if you tried to submit another job to a data node that is already executing a job. Could other job queries the status of a running job? Yes, I will post a patch soon which will support the QueryPlan RPC which will return the current status of an executing or last executed plan.
          Hide
          Lei (Eddy) Xu added a comment -

          Hey, Anu Engineer.

          Thanks a lot for working on this. It is a very demanding feature.

          I have a few questions regarding the design, much appreciated if you can answer them to help me to better understand the design.

          • Will disk balancer be a daemon or a CLI tool, that waits the process to finishes?
          • Where is the Planner executed? A DiskBalancer daemon / CLI tool or NN?
          • When copying a replica from one volume to another, how to prevent it to have concurrent issues with the DirectoryScanner in background?
          • Is there a limitation of how many such disk balancer jobs can run in the cluster? Could other job queries the status of a running job?

          Thanks again!

          Show
          Lei (Eddy) Xu added a comment - Hey, Anu Engineer . Thanks a lot for working on this. It is a very demanding feature. I have a few questions regarding the design, much appreciated if you can answer them to help me to better understand the design. Will disk balancer be a daemon or a CLI tool, that waits the process to finishes? Where is the Planner executed? A DiskBalancer daemon / CLI tool or NN? When copying a replica from one volume to another, how to prevent it to have concurrent issues with the DirectoryScanner in background? Is there a limitation of how many such disk balancer jobs can run in the cluster? Could other job queries the status of a running job? Thanks again!
          Hide
          Anu Engineer added a comment -
          • Fix the white space issue.
          • Ignored the checkstyle issues since they are all of the this.x = x; where x hides a local name. They are from getters and setters.
          • test failures don't seem to be related to this patch.
          Show
          Anu Engineer added a comment - Fix the white space issue. Ignored the checkstyle issues since they are all of the this.x = x; where x hides a local name. They are from getters and setters. test failures don't seem to be related to this patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Just have created a new branch HDFS-1312.

          Show
          Tsz Wo Nicholas Sze added a comment - Just have created a new branch HDFS-1312 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          BTW, the design doc did not mention anything related to Federation. We should think about how to handle Federation even if it is not supported in the first phase.

          Show
          Tsz Wo Nicholas Sze added a comment - BTW, the design doc did not mention anything related to Federation. We should think about how to handle Federation even if it is not supported in the first phase.
          Hide
          Anu Engineer added a comment -

          Yes. Mean and variance are well known concepts. It is easier for other people to understand the feature and read the matrices/reports. They don't have to look up what does it mean by "data density".

          Thanks, I will update the patch with that change. The current patch in HDFS-9420 contains notions of "data density".

          Show
          Anu Engineer added a comment - Yes. Mean and variance are well known concepts. It is easier for other people to understand the feature and read the matrices/reports. They don't have to look up what does it mean by "data density". Thanks, I will update the patch with that change. The current patch in HDFS-9420 contains notions of "data density".
          Hide
          Anu Engineer added a comment -

          Similar to Balancer, we need to define a threshold so that the storage is considered as balanced if its dfsUsedRatio is within nodeWeightedMean +/- threshold.

          Sorry this detail is not in the original design document. It escaped me then. It is defined in code and also in the Archtecture docuement from page.5 of Architecture_and_test_paln.pdf

          dfs.disk.balancer.block.tolerance.percent |   5  | Since data nodes are operational we stop copying data if we have reached a good enough threshold
          

          It is not based on werightedMean since the data node is operational and we compute the plan for data moves from a static snapshot of disk utilization. Currently we support a simple knob in the config that
          allows us to say what is a good enough target, that is if we reach close to 5% of desired value in terms of data movement , we can consider it balanced. Please do let me know if this addresses the concern.

          DataTransferProtocol.replaceBlock does support move blocks across storage types within the same node. We only need to slightly modify it for disk balancing (i.e. moving block within the same storage type in the same node.)

          Something that should have been part of the design document. Thanks for bring it up. Actual move is performed by calling into FsDataSetImpl.java#moveBlockAcrossStorage which is part of mover's logic.
          Here is the high level logic in DiskBalancer.java – It finds blocks on the source volume and moves them to destination volume by making calls to moveBlocksAcrossStorage ( slightly modified version of this function)

          Show
          Anu Engineer added a comment - Similar to Balancer, we need to define a threshold so that the storage is considered as balanced if its dfsUsedRatio is within nodeWeightedMean +/- threshold. Sorry this detail is not in the original design document. It escaped me then. It is defined in code and also in the Archtecture docuement from page.5 of Architecture_and_test_paln.pdf dfs.disk.balancer.block.tolerance.percent | 5 | Since data nodes are operational we stop copying data if we have reached a good enough threshold It is not based on werightedMean since the data node is operational and we compute the plan for data moves from a static snapshot of disk utilization. Currently we support a simple knob in the config that allows us to say what is a good enough target, that is if we reach close to 5% of desired value in terms of data movement , we can consider it balanced. Please do let me know if this addresses the concern. DataTransferProtocol.replaceBlock does support move blocks across storage types within the same node. We only need to slightly modify it for disk balancing (i.e. moving block within the same storage type in the same node.) Something that should have been part of the design document. Thanks for bring it up. Actual move is performed by calling into FsDataSetImpl.java#moveBlockAcrossStorage which is part of mover's logic. Here is the high level logic in DiskBalancer.java – It finds blocks on the source volume and moves them to destination volume by making calls to moveBlocksAcrossStorage ( slightly modified version of this function)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... Do you think redefining in terms of nodeWeightedVariance is more precise as compared to the earlier method ? ...

          Yes. Mean and variance are well known concepts. It is easier for other people to understand the feature and read the matrices/reports. They don't have to look up what does it mean by "data density".

          Show
          Tsz Wo Nicholas Sze added a comment - > ... Do you think redefining in terms of nodeWeightedVariance is more precise as compared to the earlier method ? ... Yes. Mean and variance are well known concepts. It is easier for other people to understand the feature and read the matrices/reports. They don't have to look up what does it mean by "data density".
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Some more comments:

          • Similar to Balancer, we need to define a threshold so that the storage is considered as balanced if its dfsUsedRatio is within nodeWeightedMean +/- threshold.
          • DataTransferProtocol.replaceBlock does support move blocks across storage types within the same node. We only need to slightly modify it for disk balancing (i.e. moving block within the same storage type in the same node.)
          Show
          Tsz Wo Nicholas Sze added a comment - Some more comments: Similar to Balancer, we need to define a threshold so that the storage is considered as balanced if its dfsUsedRatio is within nodeWeightedMean +/- threshold. DataTransferProtocol.replaceBlock does support move blocks across storage types within the same node. We only need to slightly modify it for disk balancing (i.e. moving block within the same storage type in the same node.)
          Hide
          Anu Engineer added a comment -

          Tsz Wo Nicholas Sze You are absolutely right, we could redefine these metrics in terms of standard terms. I will update the document. However I do have some code which uses the

          sum(abs(nodeDataDensity(i))
          

          as defined in section 4.1.2. Do you think redefining in terms of nodeWeightedVariance is more precise as compared to the earlier method ? This is just for my understanding of the issues of the earlier algorithm.

          Show
          Anu Engineer added a comment - Tsz Wo Nicholas Sze You are absolutely right, we could redefine these metrics in terms of standard terms. I will update the document. However I do have some code which uses the sum(abs(nodeDataDensity(i)) as defined in section 4.1.2. Do you think redefining in terms of nodeWeightedVariance is more precise as compared to the earlier method ? This is just for my understanding of the issues of the earlier algorithm.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Note also that the calculation of nodeWeightedVariance can be simplified as

          nodeWeightedVariance = sum(w_i * ratio_i^2) - nodeWeightedMean^2.
          
          Show
          Tsz Wo Nicholas Sze added a comment - Note also that the calculation of nodeWeightedVariance can be simplified as nodeWeightedVariance = sum(w_i * ratio_i^2) - nodeWeightedMean^2.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Anu, the design doc looks good in general.

          I think we don't need to define volumeDataDensity and nodeDataDensity in Section 4.1. We may simply formulate the calculation using weighted mean and weighted variance.

          • dfsUsedRatio_i for storage i is defined the same as before, i.e.
            dfsUsedRatio_i = dfsUsed_i/capacity_i.
            
          • Define normalized weight using capacity as
            w_i = capacity_i / sum(capacity_i).
            
          • Then, define
                nodeWeightedMean = sum(w_i * dfsUsedRatio_i), and
            nodeWeightedVariance = sum(w_i * (ratio_i - nodeWeightedMean)^2).
            

            We use nodeWeightedVariance (instead of nodeDataDensity) to do comparison. Note that nodeWeightedMean is the same as idealStorage.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Anu, the design doc looks good in general. I think we don't need to define volumeDataDensity and nodeDataDensity in Section 4.1. We may simply formulate the calculation using weighted mean and weighted variance. dfsUsedRatio_i for storage i is defined the same as before, i.e. dfsUsedRatio_i = dfsUsed_i/capacity_i. Define normalized weight using capacity as w_i = capacity_i / sum(capacity_i). Then, define nodeWeightedMean = sum(w_i * dfsUsedRatio_i), and nodeWeightedVariance = sum(w_i * (ratio_i - nodeWeightedMean)^2). We use nodeWeightedVariance (instead of nodeDataDensity) to do comparison. Note that nodeWeightedMean is the same as idealStorage.
          Hide
          Anu Engineer added a comment -

          This document discusses disk balancer implementation details and test plan. I will file a set of follow up JIRAs for the disk balancer.

          Show
          Anu Engineer added a comment - This document discusses disk balancer implementation details and test plan. I will file a set of follow up JIRAs for the disk balancer.
          Hide
          Andrew Wang added a comment -

          We discussed changing the default policy in HDFS-8538, Arpit wanted a failsafe to prevent overloading a disk with too many writes before we changed the default. Linking it here.

          Show
          Andrew Wang added a comment - We discussed changing the default policy in HDFS-8538 , Arpit wanted a failsafe to prevent overloading a disk with too many writes before we changed the default. Linking it here.
          Hide
          Aaron T. Myers added a comment -

          There are a large number of clusters which are using round-robin scheduling. I have been looking around for the data on HDFS-1804 (In fact the proposal discusses that issue). It will be good if you have some data on HDFS-1804 deployment. Most of the clusters that I am (anecdotal, I know) seeing are based on round robin scheduling. Please also see the thread I refer to in the proposal on linkedin, and you will see customers are also looking for this data.

          Presumably that's mostly because the round-robin volume choosing policy is the default, and many users don't even know that there's an alternative. Independently of the need to implement an active balancer as this JIRA proposes, should we consider changing the default to the available space volume choosing policy? We'd probably need to do this in Hadoop 3.0, since I'd think this should be considered an incompatible change.

          Show
          Aaron T. Myers added a comment - There are a large number of clusters which are using round-robin scheduling. I have been looking around for the data on HDFS-1804 (In fact the proposal discusses that issue). It will be good if you have some data on HDFS-1804 deployment. Most of the clusters that I am (anecdotal, I know) seeing are based on round robin scheduling. Please also see the thread I refer to in the proposal on linkedin, and you will see customers are also looking for this data. Presumably that's mostly because the round-robin volume choosing policy is the default, and many users don't even know that there's an alternative. Independently of the need to implement an active balancer as this JIRA proposes, should we consider changing the default to the available space volume choosing policy? We'd probably need to do this in Hadoop 3.0, since I'd think this should be considered an incompatible change.
          Hide
          Anu Engineer added a comment -

          Daniel Templeton Thanks for your comments, Please see my thoughts on your comments.

          Given HDFS-1804, I think Steve Loughran's original proposal of a balancer script that can be run manually while the DN is offline sounds like a simpler/safer/better approach. Since the primary remaining source of imbalance is disk failure, an offline process seems sensible. What's the main motivation for building an online-balancer?

          1. We supported hot-swapping in HDFS-1362, so this feature compliments that. Off-line was a good idea when Steve Loughran proposed it , but I am not sure if it helps currently.
          2. There are a large number of clusters which are using round-robin scheduling. I have been looking around for the data on HDFS-1804 (In fact the proposal discusses that issue). It will be good if you have some data on HDFS-1804 deployment. Most of the clusters that I am (anecdotal, I know) seeing are based on round robin scheduling. Please also see the thread I refer to in the proposal on linkedin, and you will see customers are also looking for this data.

          This has been a painful issue with multiple customers complaining about, and addressing that would improve the use of HDFS. Please look at section 2 of the proposal to see a large set of issues that customers have been complaining about, and how hard some of the workarounds are.

          The reporting aspect should perhaps be handled under HDFS-1121.

          In order to have a tool that can do disk balancing , the user usually asks a simple question, "Which machines in my cluster needs rebalancing" since there is an I/O cost involved in running any disk balancing operation, and may times you need to do this proactively since there are times when disk go out of balance (see some comments earlier in this JIRA itself) even without failure. This proposal says that we will create a metric that describes ideal data distribution and how far a given node is from that ideal, which will allow us to compute a set of nodes that will benefit from disk balancing and how the actual balancing should look like (that is, what are the data movements that we are planning). I don't think HDFS-1121 is talking about this requirement. I do think being able to answer "which machines need disk balancing" makes for a good operational interface for HDFS.

          Show
          Anu Engineer added a comment - Daniel Templeton Thanks for your comments, Please see my thoughts on your comments. Given HDFS-1804 , I think Steve Loughran 's original proposal of a balancer script that can be run manually while the DN is offline sounds like a simpler/safer/better approach. Since the primary remaining source of imbalance is disk failure, an offline process seems sensible. What's the main motivation for building an online-balancer? We supported hot-swapping in HDFS-1362 , so this feature compliments that. Off-line was a good idea when Steve Loughran proposed it , but I am not sure if it helps currently. There are a large number of clusters which are using round-robin scheduling. I have been looking around for the data on HDFS-1804 (In fact the proposal discusses that issue). It will be good if you have some data on HDFS-1804 deployment. Most of the clusters that I am (anecdotal, I know) seeing are based on round robin scheduling. Please also see the thread I refer to in the proposal on linkedin, and you will see customers are also looking for this data. This has been a painful issue with multiple customers complaining about, and addressing that would improve the use of HDFS. Please look at section 2 of the proposal to see a large set of issues that customers have been complaining about, and how hard some of the workarounds are. The reporting aspect should perhaps be handled under HDFS-1121 . In order to have a tool that can do disk balancing , the user usually asks a simple question, "Which machines in my cluster needs rebalancing" since there is an I/O cost involved in running any disk balancing operation, and may times you need to do this proactively since there are times when disk go out of balance (see some comments earlier in this JIRA itself) even without failure. This proposal says that we will create a metric that describes ideal data distribution and how far a given node is from that ideal, which will allow us to compute a set of nodes that will benefit from disk balancing and how the actual balancing should look like (that is, what are the data movements that we are planning). I don't think HDFS-1121 is talking about this requirement. I do think being able to answer "which machines need disk balancing" makes for a good operational interface for HDFS.
          Hide
          Daniel Templeton added a comment -

          The reporting aspect should perhaps be handled under HDFS-1121.

          Given HDFS-1804, I think Steve Loughran's original proposal of a balancer script that can be run manually while the DN is offline sounds like a simpler/safer/better approach. Since the primary remaining source of imbalance is disk failure, an offline process seems sensible. What's the main motivation for building an online-balancer?

          Show
          Daniel Templeton added a comment - The reporting aspect should perhaps be handled under HDFS-1121 . Given HDFS-1804 , I think Steve Loughran 's original proposal of a balancer script that can be run manually while the DN is offline sounds like a simpler/safer/better approach. Since the primary remaining source of imbalance is disk failure, an offline process seems sensible. What's the main motivation for building an online-balancer?
          Hide
          Anu Engineer added a comment -

          David Kaiser Thanks for you comment. Without going into too much details, the answer is yes. DiskBalancer is a simple data layout program that allows the end user to control how much data resides on each volume.

          Show
          Anu Engineer added a comment - David Kaiser Thanks for you comment. Without going into too much details, the answer is yes. DiskBalancer is a simple data layout program that allows the end user to control how much data resides on each volume.
          Hide
          David Kaiser added a comment -

          Anu, I would like to add a comment to the tool proposal. Could an administrator specify that a particular volume (or list of volumes) end up with exactly zero blocks, effectively emptying that volume from block storage, say to move blocks to support the task of "decommissioning a drive".

          Intent of this would be that all blocks from the specified volume would be moved, as proposed by the DiskBalancer logic to the other volumes so that the then-empty volume could be removed from the dfs.datanode.dir path mapping.

          Show
          David Kaiser added a comment - Anu, I would like to add a comment to the tool proposal. Could an administrator specify that a particular volume (or list of volumes) end up with exactly zero blocks, effectively emptying that volume from block storage, say to move blocks to support the task of "decommissioning a drive". Intent of this would be that all blocks from the specified volume would be moved, as proposed by the DiskBalancer logic to the other volumes so that the then-empty volume could be removed from the dfs.datanode.dir path mapping.
          Hide
          Anu Engineer added a comment -

          I would like to propose a tool similar to balancer that addresses this issue. The details are in the attached document.

          Show
          Anu Engineer added a comment - I would like to propose a tool similar to balancer that addresses this issue. The details are in the attached document.
          Hide
          Dave Marion added a comment -

          Note that as of 2.6.0 the layout has changed on the DataNode. Please see [1] for more information. I don't know if the tools mentioned above will work with these new restrictions.

          [1] https://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

          Show
          Dave Marion added a comment - Note that as of 2.6.0 the layout has changed on the DataNode. Please see [1] for more information. I don't know if the tools mentioned above will work with these new restrictions. [1] https://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F
          Hide
          Benoit Perroud added a comment -

          Here is a rebalancer we're using internally on small clusters: https://github.com/killerwhile/volume-balancer

          But from our experience, it's easier (and sometime faster) to wipe all the drives and run a balancer.

          Show
          Benoit Perroud added a comment - Here is a rebalancer we're using internally on small clusters: https://github.com/killerwhile/volume-balancer But from our experience, it's easier (and sometime faster) to wipe all the drives and run a balancer.
          Hide
          Kevin Lyda added a comment -

          I wrote the following in Java which... is not really my favourite language. However Hadoop is written in Java and if the code will ever make it into Hadoop proper that seems like an important requirement.

          My particular cluster that I need this working on is a 1.0.3 cluster so my code is written for that. Hence ant, etc. It assumes a built 1.0.3 tree lives in ../hadoop-common. Again, not a Java person so I know this isn't the greatest setup.

          The code is here:

          https://bitbucket.org/lyda/intranode-balance

          I've run it on a test cluster but haven't tried it on "real" data yet. In the meantime pull requests are accepted / desired / encouraged / etc.

          Show
          Kevin Lyda added a comment - I wrote the following in Java which... is not really my favourite language. However Hadoop is written in Java and if the code will ever make it into Hadoop proper that seems like an important requirement. My particular cluster that I need this working on is a 1.0.3 cluster so my code is written for that. Hence ant, etc. It assumes a built 1.0.3 tree lives in ../hadoop-common. Again, not a Java person so I know this isn't the greatest setup. The code is here: https://bitbucket.org/lyda/intranode-balance I've run it on a test cluster but haven't tried it on "real" data yet. In the meantime pull requests are accepted / desired / encouraged / etc.
          Hide
          Michael Schmitz added a comment -

          I wrote this code to balance blocks on a single data node. I'm not sure if it's the right approach at all--particularly with how I deal with subdirectories. I wish there were better documentation on how to manually balance a disk.

          https://github.com/schmmd/hadoop-balancer

          My biggest problem with my code is it's way too slow and it requires the cluster to be down.

          Show
          Michael Schmitz added a comment - I wrote this code to balance blocks on a single data node. I'm not sure if it's the right approach at all--particularly with how I deal with subdirectories. I wish there were better documentation on how to manually balance a disk. https://github.com/schmmd/hadoop-balancer My biggest problem with my code is it's way too slow and it requires the cluster to be down.
          Hide
          Kevin Lyda added a comment -

          Are there any decent code examples for how one might do this? I've never done hadoop dev so some pointers for where to start would be appreciated.

          Show
          Kevin Lyda added a comment - Are there any decent code examples for how one might do this? I've never done hadoop dev so some pointers for where to start would be appreciated.
          Hide
          Steve Hoffman added a comment -

          Given the general nature of HDFS and its many uses (HBase, M/R, etc) as much as I'd like it to "just work", it is clear it always depends on the use. Maybe one day we won't need a balancer script for disks (or for the cluster).

          I'm totally OK with having a machine-level balancer script. We use the HDFS balancer to fix inter-machine imbalances when they crop up (again, for a variety of reasons). It makes sense to have a manual script for intra-machine imbalances for people who DO have issues and make it part of the standard install (like the HDFS balancer).

          Show
          Steve Hoffman added a comment - Given the general nature of HDFS and its many uses (HBase, M/R, etc) as much as I'd like it to "just work", it is clear it always depends on the use. Maybe one day we won't need a balancer script for disks (or for the cluster). I'm totally OK with having a machine-level balancer script. We use the HDFS balancer to fix inter-machine imbalances when they crop up (again, for a variety of reasons). It makes sense to have a manual script for intra-machine imbalances for people who DO have issues and make it part of the standard install (like the HDFS balancer).
          Hide
          Steve Loughran added a comment -

          @Kevin: loss of a single disk is an event that not only preserves the rest of the data on the server, the server keeps going. You get 1-3TB of network traffic as the underreplicated data is re-duplicated, but that's all.

          Show
          Steve Loughran added a comment - @Kevin: loss of a single disk is an event that not only preserves the rest of the data on the server, the server keeps going. You get 1-3TB of network traffic as the underreplicated data is re-duplicated, but that's all.
          Hide
          Kevin Lyda added a comment -

          As an alternative, this issue could be avoided if clusters were configured with LVM and data striping. Well, not avoided, but with the problem pushed down a layer. However tuning docs discourage using LVM.

          Besides performance, another obvious downside to putting a single fs across all disks is that a single disk failure would destroy the entire node as opposed to a portion of it. However I'm new to hadoop so am not sure how a partial data loss event is handled. If a data node w/ a partial data loss can be restored w/o copying in the data it has not lost, then avoiding striped LVM is a huge win. Particularly for the XX TB case mentioned above where balancing in a new node from scratch would involve a huge data transfer with time and network consequences.

          Show
          Kevin Lyda added a comment - As an alternative, this issue could be avoided if clusters were configured with LVM and data striping. Well, not avoided, but with the problem pushed down a layer. However tuning docs discourage using LVM. Besides performance, another obvious downside to putting a single fs across all disks is that a single disk failure would destroy the entire node as opposed to a portion of it. However I'm new to hadoop so am not sure how a partial data loss event is handled. If a data node w/ a partial data loss can be restored w/o copying in the data it has not lost, then avoiding striped LVM is a huge win. Particularly for the XX TB case mentioned above where balancing in a new node from scratch would involve a huge data transfer with time and network consequences.
          Hide
          Kevin Lyda added a comment -

          Continuing on Eli's comment, modifying the placement policy also fails to handle deletions.

          I'm currently experiencing this on my cluster where the first datadir is both smaller and getting more of the data (for reasons I'm still trying to figure out - it might be due to how the machines were configured historically). The offline rebalance script sounds like a good first start.

          Show
          Kevin Lyda added a comment - Continuing on Eli's comment, modifying the placement policy also fails to handle deletions. I'm currently experiencing this on my cluster where the first datadir is both smaller and getting more of the data (for reasons I'm still trying to figure out - it might be due to how the machines were configured historically). The offline rebalance script sounds like a good first start.
          Hide
          Eli Collins added a comment -

          There are issues with just modifying the placement policy:

          1. It only solves the problem for new blocks. If you add a bunch of new disks you want to rebalance the cluster immediately to get better read throughput. And if you have to implement block balancing (eg on startup) then you don't need to modify the placement policy.
          2. The internal policy intentionally avoids disk usage to optimize for performance (round robin'ing blocks across all spindles). As you point out in some cases this won't be much of a hit, but on a 12 disk machine where half the disks are new the impact will be noticeable.
          3. There are multiple placement policies now that they're pluggable, this requires every policy solve this problem vs just solving it once.

          IMO a background process would actually be easier then modifying the placement policy. Just balancing on DN startup is simplest and would solve most people issues, though would require a rolling DN restart if you wanted to do it on-line.

          Show
          Eli Collins added a comment - There are issues with just modifying the placement policy: It only solves the problem for new blocks. If you add a bunch of new disks you want to rebalance the cluster immediately to get better read throughput. And if you have to implement block balancing (eg on startup) then you don't need to modify the placement policy. The internal policy intentionally avoids disk usage to optimize for performance (round robin'ing blocks across all spindles). As you point out in some cases this won't be much of a hit, but on a 12 disk machine where half the disks are new the impact will be noticeable. There are multiple placement policies now that they're pluggable, this requires every policy solve this problem vs just solving it once. IMO a background process would actually be easier then modifying the placement policy. Just balancing on DN startup is simplest and would solve most people issues, though would require a rolling DN restart if you wanted to do it on-line.
          Hide
          Scott Carey added a comment -

          Isn't the datanode internal block placement policy an easier/simpler solution?

          IMO if you simply placed blocks on disks based on the weight of free space available then this would not be a big issue. You would always run out of space with all drives near the same capacity. The drawback would be write performance bottlenecks in more extreme cases.

          If you were 90% full on 11 drives and 100% empty on one, then ~50% of new blocks would go to the new drive (however, few reads would hit this drive) . That is not ideal for performance but not a big problem either since it should rapidly become more balanced.

          In most situations, we would be talking about systems that have 3 to 11 drives that are 50% to 70% full and one empty drive. This would lead to between ~17% and 55% of writes going to the drive instead of the 8% or 25% that would happen if round-robin.

          IMO the default datanode block placement should be weighted towards disks with less space. There are other cases besides disk failure that can lead to imbalanced space usage, including heterogeneous partition sizes. That would mitigate the need for any complicated background rebalance tasks.

          Perhaps on start-up a datanode could optionally do some local rebalancing before joining the cluster.

          Show
          Scott Carey added a comment - Isn't the datanode internal block placement policy an easier/simpler solution? IMO if you simply placed blocks on disks based on the weight of free space available then this would not be a big issue. You would always run out of space with all drives near the same capacity. The drawback would be write performance bottlenecks in more extreme cases. If you were 90% full on 11 drives and 100% empty on one, then ~50% of new blocks would go to the new drive (however, few reads would hit this drive) . That is not ideal for performance but not a big problem either since it should rapidly become more balanced. In most situations, we would be talking about systems that have 3 to 11 drives that are 50% to 70% full and one empty drive. This would lead to between ~17% and 55% of writes going to the drive instead of the 8% or 25% that would happen if round-robin. IMO the default datanode block placement should be weighted towards disks with less space. There are other cases besides disk failure that can lead to imbalanced space usage, including heterogeneous partition sizes. That would mitigate the need for any complicated background rebalance tasks. Perhaps on start-up a datanode could optionally do some local rebalancing before joining the cluster.
          Hide
          Steve Loughran added a comment -

          Let's start with a python script that people can use while the DN is offline, as that would work against all versions. Moving as a background task would be much better -but I can imagine surprises (need to make sure block isn't in use locally, remotely, handling of very unbalanced disks (where disk #4 is a 50TB NFS mount, etc).

          Show
          Steve Loughran added a comment - Let's start with a python script that people can use while the DN is offline, as that would work against all versions. Moving as a background task would be much better -but I can imagine surprises (need to make sure block isn't in use locally, remotely, handling of very unbalanced disks (where disk #4 is a 50TB NFS mount, etc).
          Hide
          Eli Collins added a comment -

          This one is on my radar, just haven't had time to get to it. The most common use cases for this I've seen is an admin adding new drives, or drives have mixed capacities so smaller drives fill faster. For this occasionally imbalance a DN startup option that rebalances the disks and shutsdown is probably sufficient. OTOH it would be pretty easy to have a background task that moves finalized blocks from heavy disks to lightly-loaded disks.

          Show
          Eli Collins added a comment - This one is on my radar, just haven't had time to get to it. The most common use cases for this I've seen is an admin adding new drives, or drives have mixed capacities so smaller drives fill faster. For this occasionally imbalance a DN startup option that rebalances the disks and shutsdown is probably sufficient. OTOH it would be pretty easy to have a background task that moves finalized blocks from heavy disks to lightly-loaded disks.
          Hide
          Steve Loughran added a comment -

          I don't think it's a wontfix, just that nobody has sat down to fix it. With the trend towards very storage intense servers (12-16x 3TBs), this problem has grown -and it does need fixing

          What does it take?

          1. tests
          2. solution
          3. patch submit and review.

          It may be possible to implement the rebalance operation as some script (Please, not bash, something better like Python); ops could run this script after installing a new HDD.

          A way to assess HDD imbalance in a DN or across the cluster would be good too; that could pick up problems which manual/automated intervention could fix.

          BTW,, not everyone splits MR and DFS onto separate disk partitions, as that remove flexibility and the ability to handle large-intermediate-output MR jobs.

          Show
          Steve Loughran added a comment - I don't think it's a wontfix, just that nobody has sat down to fix it. With the trend towards very storage intense servers (12-16x 3TBs), this problem has grown -and it does need fixing What does it take? tests solution patch submit and review. It may be possible to implement the rebalance operation as some script (Please, not bash, something better like Python); ops could run this script after installing a new HDD. A way to assess HDD imbalance in a DN or across the cluster would be good too; that could pick up problems which manual/automated intervention could fix. BTW,, not everyone splits MR and DFS onto separate disk partitions, as that remove flexibility and the ability to handle large-intermediate-output MR jobs.
          Hide
          Steve Hoffman added a comment -

          Yes, I think this should be fixed.

          This was my original question really. Since it hasn't made the cut in over 2 years, I was wondering what it would take to either do something with this or should it be closed it as a "won't fix" with script/documentation support for the admins?

          No, I don't think this is as big of an issue as most people think.

          Basically, I agree with you. There are worse things that can go wrong.

          At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...)

          Agreed. More getting installed Friday. Just don't want bad timing/luck to be a factor here – and we do clean up after the MR.

          As you scale, you care less about the health of individual nodes and more about total framework health.

          Sorry, have to disagree here. The total framework is made up of the parts. While I agree there is enough redundancy built in to handle most cases once your node count gets above a certain level, you are basically saying it doesn't have to work well in all cases because more $ can be thrown at it.

          1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes.

          Our cluster is storage dense yes, so a loss of 1 node is noticeable.

          Show
          Steve Hoffman added a comment - Yes, I think this should be fixed. This was my original question really. Since it hasn't made the cut in over 2 years, I was wondering what it would take to either do something with this or should it be closed it as a "won't fix" with script/documentation support for the admins? No, I don't think this is as big of an issue as most people think. Basically, I agree with you. There are worse things that can go wrong. At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...) Agreed. More getting installed Friday. Just don't want bad timing/luck to be a factor here – and we do clean up after the MR. As you scale, you care less about the health of individual nodes and more about total framework health. Sorry, have to disagree here. The total framework is made up of the parts. While I agree there is enough redundancy built in to handle most cases once your node count gets above a certain level, you are basically saying it doesn't have to work well in all cases because more $ can be thrown at it. 1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes. Our cluster is storage dense yes, so a loss of 1 node is noticeable.
          Hide
          Allen Wittenauer added a comment -

          Since someone (off-jira) asked:

          • Yes, I think this should be fixed.
          • No, I don't think this is as big of an issue as most people think.
          • At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...)
          • As you scale, you care less about the health of individual nodes and more about total framework health.
          • 1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes.
          Show
          Allen Wittenauer added a comment - Since someone (off-jira) asked: Yes, I think this should be fixed. No, I don't think this is as big of an issue as most people think. At 70-80% full, you start to run the risk that the NN is going to have trouble placing blocks, esp if . Also, if you are like most places and put the MR spill space on the same file system as HDFS, that 70-80% is more like 100%, especially if you don't clean up after MR. (Thus why I always put MR area on a separate file system...) As you scale, you care less about the health of individual nodes and more about total framework health. 1PB isn't that big. At 12 drives per node, we're looking at ~50-60 nodes.
          Hide
          Allen Wittenauer added a comment -

          We typically run at 70-80% full as we are not made of money.

          No, you aren't made of money, but you are under-provisioned.

          Show
          Allen Wittenauer added a comment - We typically run at 70-80% full as we are not made of money. No, you aren't made of money, but you are under-provisioned.
          Hide
          Steve Hoffman added a comment -

          The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway.

          Our installation is about 1PB so I think we can say we are past "small". We typically run at 70-80% full as we are not made of money. And at 90% the disk alarms start waking people out of bed.
          I would say we very much care about the balance of a single node. When that node fills, it'll take out the region server, the M/R jobs running on it and generally anger people who's jobs have to be restarted.

          I wouldn't be so quick to discount this. And when you have enough machines, you are replacing disks more and more frequently. So ANY manual process is $ wasted in people time. Time to re-run jobs, times to take down datanode and move blocks. Time = $. To turn Hadoop into a more mature product, shouldn't we be striving for "it just works"?

          Show
          Steve Hoffman added a comment - The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway. Our installation is about 1PB so I think we can say we are past "small". We typically run at 70-80% full as we are not made of money. And at 90% the disk alarms start waking people out of bed. I would say we very much care about the balance of a single node. When that node fills, it'll take out the region server, the M/R jobs running on it and generally anger people who's jobs have to be restarted. I wouldn't be so quick to discount this. And when you have enough machines, you are replacing disks more and more frequently. So ANY manual process is $ wasted in people time. Time to re-run jobs, times to take down datanode and move blocks. Time = $. To turn Hadoop into a more mature product, shouldn't we be striving for "it just works"?
          Hide
          Andrew Purtell added a comment -

          It might be worth adding to the manual a note that after adding or replacing drives on a DataNode, when it's temporarily offline anyway, that blocks and their associated metadata file both can be moved to any defined data directory for local rebalancing?

          Show
          Andrew Purtell added a comment - It might be worth adding to the manual a note that after adding or replacing drives on a DataNode, when it's temporarily offline anyway, that blocks and their associated metadata file both can be moved to any defined data directory for local rebalancing?
          Hide
          Allen Wittenauer added a comment -

          The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it.

          That's the slow way. You can also take down the data node, move the blocks around on the disk, restart the data node.

          The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway.

          Show
          Allen Wittenauer added a comment - The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it. That's the slow way. You can also take down the data node, move the blocks around on the disk, restart the data node. The other thing is that as you grow a grid, you care less and less about the balance on individual nodes. This issue is of primary important to smaller installations who likely are under-provisioned hardware-wise anyway.
          Hide
          Steve Hoffman added a comment -

          Wow, I can't believe this is still lingering out there as a new feature request. I'd argue this is a bug – and a big one. Here's why:

          • You have Nx 12x3TB machines in your cluster.
          • 1 disk fails on 12 drive machine. Let's say they each were 80% full.
          • You install the replacement drive (0% full), but by the time you do this the under-replicated blocks have been fixed (on this and other nodes)
          • The 0% full drive will fill at the same rate as the blocks on the other disks. That machine's other 11 disks will fill to 100% as the block placement is at a node level and the node seems to use a round-robin algorithm even though there is more space.

          The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it.

          Hard drives fail. This isn't news to anybody. The larger (12 disk) nodes only make the problem worse in time to empty and fill again. Even if you had a 1U 4 disk machine it is still bad 'cause you lose 25% of your capacity on 1 disk failure where the impact of the 12 disk machine is less than 9%.

          The remove/add of a complete node seems like a pretty poor option.
          Or am I alone in this? Can we please revive this JIRA?

          Show
          Steve Hoffman added a comment - Wow, I can't believe this is still lingering out there as a new feature request. I'd argue this is a bug – and a big one. Here's why: You have Nx 12x3TB machines in your cluster. 1 disk fails on 12 drive machine. Let's say they each were 80% full. You install the replacement drive (0% full), but by the time you do this the under-replicated blocks have been fixed (on this and other nodes) The 0% full drive will fill at the same rate as the blocks on the other disks. That machine's other 11 disks will fill to 100% as the block placement is at a node level and the node seems to use a round-robin algorithm even though there is more space. The only way we have found to move blocks internally (without taking the cluster down completely) is to decommission the node and have it empty and then re-add it to the cluster so the balancer can take over and move block back onto it. Hard drives fail. This isn't news to anybody. The larger (12 disk) nodes only make the problem worse in time to empty and fill again. Even if you had a 1U 4 disk machine it is still bad 'cause you lose 25% of your capacity on 1 disk failure where the impact of the 12 disk machine is less than 9%. The remove/add of a complete node seems like a pretty poor option. Or am I alone in this? Can we please revive this JIRA?
          Hide
          Steve Loughran added a comment -

          Another driver for rebalancing is a cluster set up with mapred temp space allocated to the same partitions as HDFS. This delivers best IO rate for temp storage, but once that temp space is reclaimed, the disks can get unbalanced.

          Show
          Steve Loughran added a comment - Another driver for rebalancing is a cluster set up with mapred temp space allocated to the same partitions as HDFS. This delivers best IO rate for temp storage, but once that temp space is reclaimed, the disks can get unbalanced.
          Hide
          Sanjay Radia added a comment -

          >> From my point, there are mainly two cases of re-balance:
          >There's a third, and one I suspect is more common than people realize: mass deletes.
          Since the creates are distributed, there is high probability that deletes will be distributed.

          Show
          Sanjay Radia added a comment - >> From my point, there are mainly two cases of re-balance: >There's a third, and one I suspect is more common than people realize: mass deletes. Since the creates are distributed, there is high probability that deletes will be distributed.
          Hide
          Wang Xu added a comment -

          Hi Allen,

          For the jsp that showing local filesystem, I will comments on HDFS-1121 for detail requirements

          And to simplify the design, I agree that we could move finalized blocks only and do not touch existed dirs.

          Show
          Wang Xu added a comment - Hi Allen, For the jsp that showing local filesystem, I will comments on HDFS-1121 for detail requirements And to simplify the design, I agree that we could move finalized blocks only and do not touch existed dirs.
          Hide
          Allen Wittenauer added a comment -

          >For the cluster monitoring issue, we could not expect an integrated monitoring
          > substitute the external monitoring system. Thus, we need to make clear
          > what information gathering requirements should be considered inside hdfs.

          There is nothing preventing us from building a jsp that runs on the datanode that shows the file systems in use and the relevant stats for those fs's.

          >And the "lock" I mentions is to stop write new blocks into the volume,
          > which is the simplest way to migrate blocks.

          This is going to be a big performance hit. I think the focus should be on already written/closed blocks and just ignore new blocks.

          Show
          Allen Wittenauer added a comment - >For the cluster monitoring issue, we could not expect an integrated monitoring > substitute the external monitoring system. Thus, we need to make clear > what information gathering requirements should be considered inside hdfs. There is nothing preventing us from building a jsp that runs on the datanode that shows the file systems in use and the relevant stats for those fs's. >And the "lock" I mentions is to stop write new blocks into the volume, > which is the simplest way to migrate blocks. This is going to be a big performance hit. I think the focus should be on already written/closed blocks and just ignore new blocks.
          Hide
          Wang Xu added a comment -

          Hi Allen,

          For the cluster monitoring issue, we could not expect an integrated monitoring substitute the external monitoring system. Thus, we need to make clear what information gathering requirements should be considered inside hdfs.

          And the "lock" I mentions is to stop write new blocks into the volume, which is the simplest way to migrate blocks. And I do think the depth is matter, from my understanding, the deeper the dir is, the more io will be wasted to reach the block.

          Show
          Wang Xu added a comment - Hi Allen, For the cluster monitoring issue, we could not expect an integrated monitoring substitute the external monitoring system. Thus, we need to make clear what information gathering requirements should be considered inside hdfs. And the "lock" I mentions is to stop write new blocks into the volume, which is the simplest way to migrate blocks. And I do think the depth is matter, from my understanding, the deeper the dir is, the more io will be wasted to reach the block.
          Hide
          Allen Wittenauer added a comment -

          > IMHO, you can monitor the file distribution among disks with external tools
          > such as ganglia, is it required to integrate it in the Web interface of HDFS?

          Yes. First off, Ganglia sucks at large scales for too many reasons to go into here. Secondly, only Hadoop really knows what file systems are in play for HDFS.

          > From my point, there are mainly two cases of re-balance:

          There's a third, and one I suspect is more common than people realize: mass deletes.

          > Re-balance should be only process while it is not in heavy load
          > (should this be guaranteed by the administrator?)

          and

          > Lock origin disks: stop written to them and wait finalization on them.

          I don't think it is realistic to expect the system to be idle during a rebalance. Ultimately, it shouldn't matter from the rebalancer's perspective anyway; the only performance hit that should be noticeable would be for blocks in the middle of being moved. Even then, the DN process knows what blocks are where and can read from the 'old' location.

          If DN's being idle are a requirement, then one is better off just shutting down the DN (and TT?) processes and doing them offline.

          > Find the deepest dirs in every selected disk and move blocks from those dirs.
          > And if a dir is empty, then the dir should also be removed.

          Why does depth matter?

          > If two or more dirs are located in a same disk, they might confuse the
          > space calculation. And this is just the case in MiniDFSCluster deployment.

          This is also the case with pooled storage systems (such as ZFS) on real clusters already.

          Show
          Allen Wittenauer added a comment - > IMHO, you can monitor the file distribution among disks with external tools > such as ganglia, is it required to integrate it in the Web interface of HDFS? Yes. First off, Ganglia sucks at large scales for too many reasons to go into here. Secondly, only Hadoop really knows what file systems are in play for HDFS. > From my point, there are mainly two cases of re-balance: There's a third, and one I suspect is more common than people realize: mass deletes. > Re-balance should be only process while it is not in heavy load > (should this be guaranteed by the administrator?) and > Lock origin disks: stop written to them and wait finalization on them. I don't think it is realistic to expect the system to be idle during a rebalance. Ultimately, it shouldn't matter from the rebalancer's perspective anyway; the only performance hit that should be noticeable would be for blocks in the middle of being moved. Even then, the DN process knows what blocks are where and can read from the 'old' location. If DN's being idle are a requirement, then one is better off just shutting down the DN (and TT?) processes and doing them offline. > Find the deepest dirs in every selected disk and move blocks from those dirs. > And if a dir is empty, then the dir should also be removed. Why does depth matter? > If two or more dirs are located in a same disk, they might confuse the > space calculation. And this is just the case in MiniDFSCluster deployment. This is also the case with pooled storage systems (such as ZFS) on real clusters already.
          Hide
          Wang Xu added a comment -

          Hi folks,

          Here is the basic design of the process. Is there any other consideration?

          The basic flow is:

          1. Re-balance should be only process while it is not in heavy load (should this be guaranteed by the administrator?)
          2. Calculate the total and average available & used space of dirs.
          3. Find the disks have most and least space, and decide move direction. We need define a unbalance threshold here to decide whether it is worthy to re-balance.
          4. Lock origin disks: stop written to them and wait finalization on them.
          5. Find the deepest dirs in every selected disk and move blocks from those dirs. And if a dir is empty, then the dir should also be removed.
          6. Check the balance status while the blocks are migrated, and break from the loop if it reaches a threshold.
          7. Release the lock.

          The case should be take into account:

          • If a disk have much less space than other disks, it might have least available space, but could not migrate blocks out.
          • If two or more dirs are located in a same disk, they might confuse the space calculation. And this is just the case in MiniDFSCluster deployment.
          Show
          Wang Xu added a comment - Hi folks, Here is the basic design of the process. Is there any other consideration? The basic flow is: Re-balance should be only process while it is not in heavy load (should this be guaranteed by the administrator?) Calculate the total and average available & used space of dirs. Find the disks have most and least space, and decide move direction. We need define a unbalance threshold here to decide whether it is worthy to re-balance. Lock origin disks: stop written to them and wait finalization on them. Find the deepest dirs in every selected disk and move blocks from those dirs. And if a dir is empty, then the dir should also be removed. Check the balance status while the blocks are migrated, and break from the loop if it reaches a threshold. Release the lock. The case should be take into account: If a disk have much less space than other disks, it might have least available space, but could not migrate blocks out. If two or more dirs are located in a same disk, they might confuse the space calculation. And this is just the case in MiniDFSCluster deployment.
          Hide
          Wang Xu added a comment -

          Hi Steve,

          From my point, there are mainly two cases of re-balance:

          1. The admin replace a disk, and then trigger the re-balance at once or in specific time.
          2. The monitoring system observed disk space distribution issue, and rise a warning; and then the admin trigger the re-balance at once or in specific time.

          Do you mean we should integrate a monitoring system inside HDFS?

          Show
          Wang Xu added a comment - Hi Steve, From my point, there are mainly two cases of re-balance: The admin replace a disk, and then trigger the re-balance at once or in specific time. The monitoring system observed disk space distribution issue, and rise a warning; and then the admin trigger the re-balance at once or in specific time. Do you mean we should integrate a monitoring system inside HDFS?
          Hide
          Steve Loughran added a comment -

          I think having a remote web view is useful in two ways
          -lets people see the basics of what is going in within the entire cluster (yes, that will need some aggregation eventually)
          -lets you write tests that hit the status pages and so verify that the rebalancing worked.

          Show
          Steve Loughran added a comment - I think having a remote web view is useful in two ways -lets people see the basics of what is going in within the entire cluster (yes, that will need some aggregation eventually) -lets you write tests that hit the status pages and so verify that the rebalancing worked.
          Hide
          Wang Xu added a comment -

          and @Steve, I did not intend to extend HDFS-1362 to cover this issue, and I only mean they are related.

          Show
          Wang Xu added a comment - and @Steve, I did not intend to extend HDFS-1362 to cover this issue, and I only mean they are related.
          Hide
          Wang Xu added a comment -

          Hi Steve,

          I have not understood the nessesary of HDFS-1121. IMHO, you can monitor the file distribution among disks with external tools such as ganglia, is it required to integrate it in the Web interface of HDFS?

          I think the regular routine is finding the problem in cluster management system and then trigger the rebalance action in HDFS.

          Show
          Wang Xu added a comment - Hi Steve, I have not understood the nessesary of HDFS-1121 . IMHO, you can monitor the file distribution among disks with external tools such as ganglia, is it required to integrate it in the Web interface of HDFS? I think the regular routine is finding the problem in cluster management system and then trigger the rebalance action in HDFS.
          Hide
          Steve Loughran added a comment -

          HDFS-1362 and this issue are part of the HDFS-664 problem "support efficient hotswap".
          Before worrying about this one, consider HDFS-1121, which is provide a way to monitor the distribution (i.e. web view). That web/management view would be how we'd test the rebalancing works, so its a pre-req. Also it's best to keep the issues independent (where possible), so worry about getting HDFS-1362 in first before trying to extend it.

          That said, because the #of HDDs/server is growing to 12 or more 2TB/unit, with 3TB on the horizon, we will need this feature in the 0.23-0.24 timeframe.

          Show
          Steve Loughran added a comment - HDFS-1362 and this issue are part of the HDFS-664 problem "support efficient hotswap". Before worrying about this one, consider HDFS-1121 , which is provide a way to monitor the distribution (i.e. web view). That web/management view would be how we'd test the rebalancing works, so its a pre-req. Also it's best to keep the issues independent (where possible), so worry about getting HDFS-1362 in first before trying to extend it. That said, because the #of HDDs/server is growing to 12 or more 2TB/unit, with 3TB on the horizon, we will need this feature in the 0.23-0.24 timeframe.
          Hide
          Wang Xu added a comment -

          I think re-balance is just the next step of substitute a failed disk as in HDFS-1362. And since I've just finished the patch of HDFS-1362, I would like to implement this feature if no one have done it.

          Show
          Wang Xu added a comment - I think re-balance is just the next step of substitute a failed disk as in HDFS-1362 . And since I've just finished the patch of HDFS-1362 , I would like to implement this feature if no one have done it.
          Hide
          Eli Collins added a comment -

          Agree, re-opening, let's have this issue track re-balancing disks within a DN.

          Show
          Eli Collins added a comment - Agree, re-opening, let's have this issue track re-balancing disks within a DN.
          Hide
          Steve Loughran added a comment -

          HDFS-1120 is part of the solution, but some rebalancing util is needed to. Maybe this issue should be re-opened as the underlying problem, HDFS-1120 and HDFS-1121 as parts of the fix, along with "add a way to rebalance HDDs on a single DN"

          Show
          Steve Loughran added a comment - HDFS-1120 is part of the solution, but some rebalancing util is needed to. Maybe this issue should be re-opened as the underlying problem, HDFS-1120 and HDFS-1121 as parts of the fix, along with "add a way to rebalance HDDs on a single DN"
          Hide
          Travis Crawford added a comment -

          HDFS-1120 looks like a better place to track this issue. Marking this as a duplicate.

          Show
          Travis Crawford added a comment - HDFS-1120 looks like a better place to track this issue. Marking this as a duplicate.
          Hide
          Scott Carey added a comment -

          a roulette-like algorithm for choosing which disk to write a block to should be relatively easy and keep the disks generally balanced except when major changes such as adding a new disk occur.

          I think the safest default option is to make the relative odds of choosing a disk equal to the free space available. The disk placement code already has access to that information and checks it to see if the disk is full. Changing it from round-robin to weighted is straightforward unless this needs to be plugable.
          A larger disk will get more writes, but there is no avoiding that if you want them to be balanced. Balancing after placement causes more overall I/O then placing it right the first time.

          Rebalancing will always be needed from time for other reasons however.

          Show
          Scott Carey added a comment - a roulette-like algorithm for choosing which disk to write a block to should be relatively easy and keep the disks generally balanced except when major changes such as adding a new disk occur. I think the safest default option is to make the relative odds of choosing a disk equal to the free space available. The disk placement code already has access to that information and checks it to see if the disk is full. Changing it from round-robin to weighted is straightforward unless this needs to be plugable. A larger disk will get more writes, but there is no avoiding that if you want them to be balanced. Balancing after placement causes more overall I/O then placing it right the first time. Rebalancing will always be needed from time for other reasons however.
          Hide
          Allen Wittenauer added a comment -

          Aha! Found it.

          One of these should get closed as a dupe of the other tho.

          Show
          Allen Wittenauer added a comment - Aha! Found it. One of these should get closed as a dupe of the other tho.
          Hide
          dhruba borthakur added a comment -

          rebalance blocks locally by each individual datanode sounds like a good idea. we have seen this problem in our cluster too.

          Show
          dhruba borthakur added a comment - rebalance blocks locally by each individual datanode sounds like a good idea. we have seen this problem in our cluster too.

            People

            • Assignee:
              Anu Engineer
              Reporter:
              Travis Crawford
            • Votes:
              27 Vote for this issue
              Watchers:
              104 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development