Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11742

Improve balancer usability after HDFS-8818

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 2.7.4, 3.0.0-beta1, 2.8.2
    • Component/s: None
    • Labels:
      None

      Description

      We ran 2.8 balancer with HDFS-8818 on a 280-node and a 2,400-node cluster. In both cases, it would hang forever after two iterations. The two iterations were also moving things at a significantly lower rate. The hang itself is fixed by HDFS-11377, but the design limitation remains, so the balancer throughput ends up actually lower.

      Instead of reverting HDFS-8188 as originally suggested, I am making a small change to make it less error prone and more usable.

      1. balancer_fix.png
        28 kB
        Kihwal Lee
      2. balancer2.8.png
        23 kB
        Kihwal Lee
      3. HDFS-11742.branch-2.8.patch
        5 kB
        Kihwal Lee
      4. HDFS-11742.branch-2.patch
        5 kB
        Kihwal Lee
      5. HDFS-11742.trunk.patch
        5 kB
        Kihwal Lee
      6. HDFS-11742.v2.trunk.patch
        4 kB
        Kihwal Lee
      7. replaceBlockNumOps-8w.jpg
        89 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Hi Kihwal Lee, let's have a more detailed discussion before reverting HDFS-8818.

          HDFS-8818 (a thread pool per datanode pair) is indeed an improvement for the previous design (a global thread pool) since it limits the number of threads assigned to a particular datanode pair. Previously, if the first datanode pair has a lot of pending moves, all the threads will be used to execute the moves for that pair so that it will be very slow since it cannot utilize the entire network.

          We also has tested the new code a lot and see significant performance improvement.

          Have you tested it with HDFS-11377?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Hi Kihwal Lee, let's have a more detailed discussion before reverting HDFS-8818 . HDFS-8818 (a thread pool per datanode pair) is indeed an improvement for the previous design (a global thread pool) since it limits the number of threads assigned to a particular datanode pair. Previously, if the first datanode pair has a lot of pending moves, all the threads will be used to execute the moves for that pair so that it will be very slow since it cannot utilize the entire network. We also has tested the new code a lot and see significant performance improvement. Have you tested it with HDFS-11377 ?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... The reason is explained in the jira comments. ...

          I think you mean the following comment:

          ..., but the design is still flawed. Move decision is made then thread pools are created, which likely only cover a subset of targets in the schedule. Also, the schedule will less likely make use of all 50 threads anyway. In any case, the moves will all pile up to the first 25 targets (in our case, 25 thread pools seemed to be the limit). I don't think it will be faster than the previous one.

          If 50 is not a good value in your case, you may change it by setting dfs.datanode.balance.max.concurrent.moves.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... The reason is explained in the jira comments. ... I think you mean the following comment: ..., but the design is still flawed. Move decision is made then thread pools are created, which likely only cover a subset of targets in the schedule. Also, the schedule will less likely make use of all 50 threads anyway. In any case, the moves will all pile up to the first 25 targets (in our case, 25 thread pools seemed to be the limit). I don't think it will be faster than the previous one. If 50 is not a good value in your case, you may change it by setting dfs.datanode.balance.max.concurrent.moves.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment - - edited

          What we need is HDFS-7639. The number of moves should NOT be limited by the number of threads; see this comment.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - - edited What we need is HDFS-7639 . The number of moves should NOT be limited by the number of threads; see this comment .
          Hide
          kihwal Kihwal Lee added a comment - - edited

          I ran Balancer with the suggested revert and then with HDFS-11377. I won't bother to post plots for pre-HDFS-11377. It barely registers. As you can see it is still not as good as revert. Sure it might work well for certain cases, but clearly performs poorly on all the clusters we have tried it on. If 2.8.1 is put up for vote with this, I will have to -1 the release.

          you may change it by setting dfs.datanode.balance.max.concurrent.moves.

          It is not feasible to tune it per cluster, even if it improves the situation.

          What we need is HDFS-7639

          I agree that dispatching needs to be asynchronous. But, I don't see HDFS-8818 as a stepping stone or prerequisite. Since we are trying to release 2.8.1, I suggest HDFS-8818 be reverted and the improvement be redesigned.

          Show
          kihwal Kihwal Lee added a comment - - edited I ran Balancer with the suggested revert and then with HDFS-11377 . I won't bother to post plots for pre- HDFS-11377 . It barely registers. As you can see it is still not as good as revert. Sure it might work well for certain cases, but clearly performs poorly on all the clusters we have tried it on. If 2.8.1 is put up for vote with this, I will have to -1 the release. you may change it by setting dfs.datanode.balance.max.concurrent.moves. It is not feasible to tune it per cluster, even if it improves the situation. What we need is HDFS-7639 I agree that dispatching needs to be asynchronous. But, I don't see HDFS-8818 as a stepping stone or prerequisite. Since we are trying to release 2.8.1, I suggest HDFS-8818 be reverted and the improvement be redesigned.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 14m 33s trunk passed
          +1 compile 0m 54s trunk passed
          +1 checkstyle 0m 38s trunk passed
          +1 mvnsite 1m 2s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          -1 findbugs 1m 50s hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings.
          +1 javadoc 0m 48s trunk passed
          +1 mvninstall 1m 3s the patch passed
          +1 compile 1m 1s the patch passed
          +1 javac 1m 1s the patch passed
          -0 checkstyle 0m 35s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 45 unchanged - 2 fixed = 46 total (was 47)
          +1 mvnsite 0m 55s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 57s the patch passed
          +1 javadoc 0m 42s the patch passed
          -1 unit 70m 32s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          98m 58s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestEncryptionZones
            hadoop.hdfs.server.namenode.TestStartup
            hadoop.hdfs.server.namenode.TestMetadataVersionOutput



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HDFS-11742
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12866066/HDFS-11742.trunk.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a61fb8c86ce0 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / cedaf4c
          Default Java 1.8.0_121
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19280/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19280/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19280/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19280/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19280/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 14m 33s trunk passed +1 compile 0m 54s trunk passed +1 checkstyle 0m 38s trunk passed +1 mvnsite 1m 2s trunk passed +1 mvneclipse 0m 14s trunk passed -1 findbugs 1m 50s hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. +1 javadoc 0m 48s trunk passed +1 mvninstall 1m 3s the patch passed +1 compile 1m 1s the patch passed +1 javac 1m 1s the patch passed -0 checkstyle 0m 35s hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 45 unchanged - 2 fixed = 46 total (was 47) +1 mvnsite 0m 55s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 57s the patch passed +1 javadoc 0m 42s the patch passed -1 unit 70m 32s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 98m 58s Reason Tests Failed junit tests hadoop.hdfs.TestEncryptionZones   hadoop.hdfs.server.namenode.TestStartup   hadoop.hdfs.server.namenode.TestMetadataVersionOutput Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-11742 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12866066/HDFS-11742.trunk.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a61fb8c86ce0 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / cedaf4c Default Java 1.8.0_121 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19280/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19280/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19280/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19280/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19280/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Instead of reverting, I am making a simple change to make it more usable. This will prevent users from hitting the same issues we had. The changes from HDFS-8188 does allow running balancer at a higher throughput, but it needs turning multiple knobs to get there. And when it is running slower than the previous release, users will have no clue why it is so. The default config values may result in degraded performance for users running a cluster with more than 20 nodes.

          The main problem of HDFS-8188 is the way thread pool is created per target. If it reaches the limit (max mover threads), the remaining pending moves are simply dropped (Or even worse, it hangs without HDFS-11377), leading to degraded performance as demonstrated above with graphs. The suggested workaround of "set the mover thread limit to 10,000 or 30,000" simply means removing the limit. i.e. it cannot work with the limit.

          The suggested improvement calculates the size of each mover thread pool, instead of using the configured fixed value. The total thread count limit is honored without causing the degradation seen with the original design.

          Show
          kihwal Kihwal Lee added a comment - Instead of reverting, I am making a simple change to make it more usable. This will prevent users from hitting the same issues we had. The changes from HDFS-8188 does allow running balancer at a higher throughput, but it needs turning multiple knobs to get there. And when it is running slower than the previous release, users will have no clue why it is so. The default config values may result in degraded performance for users running a cluster with more than 20 nodes. The main problem of HDFS-8188 is the way thread pool is created per target. If it reaches the limit (max mover threads), the remaining pending moves are simply dropped (Or even worse, it hangs without HDFS-11377 ), leading to degraded performance as demonstrated above with graphs. The suggested workaround of "set the mover thread limit to 10,000 or 30,000" simply means removing the limit. i.e. it cannot work with the limit. The suggested improvement calculates the size of each mover thread pool, instead of using the configured fixed value. The total thread count limit is honored without causing the degradation seen with the original design.
          Hide
          kihwal Kihwal Lee added a comment -

          Setting HDFS-11377 as a dependency. In most cases, the proposed change will make it unnecessary, but it is still needed the case where the number of targets is bigger than max number of mover threads. In this case, it won't be possible to serve all targets even by allocating only one thread per target.

          Show
          kihwal Kihwal Lee added a comment - Setting HDFS-11377 as a dependency. In most cases, the proposed change will make it unnecessary, but it is still needed the case where the number of targets is bigger than max number of mover threads. In this case, it won't be possible to serve all targets even by allocating only one thread per target.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 15s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 14m 7s trunk passed
          +1 compile 0m 50s trunk passed
          +1 checkstyle 0m 37s trunk passed
          +1 mvnsite 1m 5s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          -1 findbugs 1m 50s hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings.
          +1 javadoc 0m 44s trunk passed
          +1 mvninstall 1m 1s the patch passed
          +1 compile 0m 56s the patch passed
          +1 javac 0m 56s the patch passed
          -0 checkstyle 0m 36s hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 46 unchanged - 1 fixed = 49 total (was 47)
          +1 mvnsite 1m 2s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 58s the patch passed
          +1 javadoc 0m 41s the patch passed
          -1 unit 71m 3s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          99m 1s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestMaintenanceState
            hadoop.hdfs.server.namenode.ha.TestPipelinesFailover



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue HDFS-11742
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12866641/HDFS-11742.v2.trunk.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a0e65c8eb11d 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4e6bbd0
          Default Java 1.8.0_121
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19332/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19332/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/19332/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19332/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19332/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 14m 7s trunk passed +1 compile 0m 50s trunk passed +1 checkstyle 0m 37s trunk passed +1 mvnsite 1m 5s trunk passed +1 mvneclipse 0m 14s trunk passed -1 findbugs 1m 50s hadoop-hdfs-project/hadoop-hdfs in trunk has 10 extant Findbugs warnings. +1 javadoc 0m 44s trunk passed +1 mvninstall 1m 1s the patch passed +1 compile 0m 56s the patch passed +1 javac 0m 56s the patch passed -0 checkstyle 0m 36s hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 46 unchanged - 1 fixed = 49 total (was 47) +1 mvnsite 1m 2s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 58s the patch passed +1 javadoc 0m 41s the patch passed -1 unit 71m 3s hadoop-hdfs in the patch failed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 99m 1s Reason Tests Failed junit tests hadoop.hdfs.TestMaintenanceState   hadoop.hdfs.server.namenode.ha.TestPipelinesFailover Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-11742 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12866641/HDFS-11742.v2.trunk.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a0e65c8eb11d 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4e6bbd0 Default Java 1.8.0_121 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-HDFS-Build/19332/artifact/patchprocess/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs-warnings.html checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/19332/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/19332/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19332/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19332/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... If 2.8.1 is put up for vote with this, I will have to -1 the release.

          Kihwal Lee, we have more understanding of the problem now. Do you think that this is still a blocker of the release?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... If 2.8.1 is put up for vote with this, I will have to -1 the release. Kihwal Lee , we have more understanding of the problem now. Do you think that this is still a blocker of the release?
          Hide
          kihwal Kihwal Lee added a comment - - edited

          Do you think that this is still a blocker of the release?

          • HDFS-8818 has problems. It will affect many users, if it is included in a release as is. I will -1 the release if the issue is not properly addressed.
          • I believe this this jira (HDFS-11742) prevents hitting the issue. Thus this jira is a release blocker.
          • IF HDFS-11742 is included alongwith HDFS-8818 and HDFS-11377, I will retract -1 on HDFS-8818. If not, I will maintain -1.

          Do you understand the nature of the issue and how it could affect people? If so, please review the patch.

          Show
          kihwal Kihwal Lee added a comment - - edited Do you think that this is still a blocker of the release? HDFS-8818 has problems. It will affect many users, if it is included in a release as is. I will -1 the release if the issue is not properly addressed. I believe this this jira ( HDFS-11742 ) prevents hitting the issue. Thus this jira is a release blocker. IF HDFS-11742 is included alongwith HDFS-8818 and HDFS-11377 , I will retract -1 on HDFS-8818 . If not, I will maintain -1. Do you understand the nature of the issue and how it could affect people? If so, please review the patch.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > Do you understand the nature of the issue and how it could affect people? ...

          I probably still not. Please give more details.

          BTW, the replaceblockoperationspersec metrics you shown earlier. Is it just for one datanode? Have you checked the other datanodes?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > Do you understand the nature of the issue and how it could affect people? ... I probably still not. Please give more details. BTW, the replaceblockoperationspersec metrics you shown earlier. Is it just for one datanode? Have you checked the other datanodes?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... If so, please review the patch.

          I actually have looked at the patch. I am not sure it is the right approach since the datanode pairs are sorted by priorities according to the utilization and data locality. The patch tries to schedule the same number of threads to all pairs.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... If so, please review the patch. I actually have looked at the patch. I am not sure it is the right approach since the datanode pairs are sorted by priorities according to the utilization and data locality. The patch tries to schedule the same number of threads to all pairs.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment - - edited

          > HDFS-8188 has problems. ...

          BTW, the JIRA you referring to should be HDFS-8818, not HDFS-8188. You may want to relax, calm down and then fix the typos.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - - edited > HDFS-8188 has problems. ... BTW, the JIRA you referring to should be HDFS-8818 , not HDFS-8188 . You may want to relax, calm down and then fix the typos.
          Hide
          kihwal Kihwal Lee added a comment -

          I probably still not.

          I invite everyone to run Balancer post-HDFS-8818 on their moderately sized clusters with the existing or default settings. This includes you, Tsz Wo Nicholas Sze. I am not talking about quick expansion type of balancing that you seem to focus on, but a steady-state balancing.

          BTW, the replaceblockoperationspersec metrics you shown earlier. Is it just for one datanode? Have you checked the other datanodes?

          This is a aggregate of all nodes.

          I am not sure it is the right approach since the datanode pairs are sorted by priorities according to the utilization and data locality. The patch tries to schedule the same number of threads to all pairs.

          The thread pool creation is per target, not per pair in HDFS-8818 and it tries to assign the same fixed number of threads to each thread pool. Is it not? This does not change in my patch. I am simply adjusting the size of thread pool to not exceed the limit, thus avoiding the skipping problem. Once there are skippings, the throughput can go down.

          Show
          kihwal Kihwal Lee added a comment - I probably still not. I invite everyone to run Balancer post- HDFS-8818 on their moderately sized clusters with the existing or default settings. This includes you, Tsz Wo Nicholas Sze . I am not talking about quick expansion type of balancing that you seem to focus on, but a steady-state balancing. BTW, the replaceblockoperationspersec metrics you shown earlier. Is it just for one datanode? Have you checked the other datanodes? This is a aggregate of all nodes. I am not sure it is the right approach since the datanode pairs are sorted by priorities according to the utilization and data locality. The patch tries to schedule the same number of threads to all pairs. The thread pool creation is per target, not per pair in HDFS-8818 and it tries to assign the same fixed number of threads to each thread pool. Is it not? This does not change in my patch. I am simply adjusting the size of thread pool to not exceed the limit, thus avoiding the skipping problem. Once there are skippings, the throughput can go down.
          Hide
          shv Konstantin Shvachko added a comment -

          Kihwal Lee do you have a graph showing Balancer performance (metric:copyblockoperationspersec) with your patch? For comparison with what you posted earlier.

          Show
          shv Konstantin Shvachko added a comment - Kihwal Lee do you have a graph showing Balancer performance (metric:copyblockoperationspersec) with your patch? For comparison with what you posted earlier.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          If 2.8.1 is put up for vote with this, I will have to -1 the release.

          It will affect many users, if it is included in a release as is. I will -1 the release if the issue is not properly addressed.

          I'm pushing for the next 2.8 maint release as well as 2.7.x. Kihwal Lee / Tsz Wo Nicholas Sze, can you please help get ourselves to a convergence? Thanks.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - If 2.8.1 is put up for vote with this, I will have to -1 the release. It will affect many users, if it is included in a release as is. I will -1 the release if the issue is not properly addressed. I'm pushing for the next 2.8 maint release as well as 2.7.x. Kihwal Lee / Tsz Wo Nicholas Sze , can you please help get ourselves to a convergence? Thanks.
          Hide
          kihwal Kihwal Lee added a comment -

          Konstantin Shvachko, here is a graph.

          HDFS-8188 + HDFS-11377 potentially move blocks faster, if the number of mover thread is jacked up very high. "High" is relative to the size of cluster and subject to the nature of imbalance. In one (~2500 node) of our clusters, setting it to 10,000 wasn't enough. The balancer does create 10,000 threads while only subset of them are utilized. Nicholas previously suggested 30,000 and while that would have "worked", it effectively means HDFS-8188 requires the mover threads limit to be removed.

          What I did here is to honor the configured mover thread limit (default=1,000) and size a thread pool accordingly (#movers / #targets) instead of using a fixed number (default max=50). I've verified it works as good as, and sometimes better than 2.7 balancer with the identical config.

          Show
          kihwal Kihwal Lee added a comment - Konstantin Shvachko , here is a graph. HDFS-8188 + HDFS-11377 potentially move blocks faster, if the number of mover thread is jacked up very high. "High" is relative to the size of cluster and subject to the nature of imbalance. In one (~2500 node) of our clusters, setting it to 10,000 wasn't enough. The balancer does create 10,000 threads while only subset of them are utilized. Nicholas previously suggested 30,000 and while that would have "worked", it effectively means HDFS-8188 requires the mover threads limit to be removed. What I did here is to honor the configured mover thread limit (default=1,000) and size a thread pool accordingly (#movers / #targets) instead of using a fixed number (default max=50). I've verified it works as good as, and sometimes better than 2.7 balancer with the identical config.
          Hide
          shv Konstantin Shvachko added a comment -

          Hi Kihwal Lee, thanks for the data. Shows your point well.

          I have a 2600-node cluster, where we've been running Balancer pretty actively. This was initially based on 2.6, but now pretty close to branch-2.7, and with HDFS-8818 applied. Here is the graph, thanks Erik Krogen for generating it:

          The maximum number of transfers is 50 - close to yours. We do not allow the Balancer to run "full speed" to allow room for normal workload on the cluster. So I am not completely sure that HDFS-8818 is the only thing to blame for the slowness you described. I did not run branch-2.8 Balancer, but I plan to test on branch-2.7 in the very near future.

          Also looked at your patch. Sounds like a reasonable change to me. But may be we want to understand more closely at the root cause of the behavior before we commit?

          Show
          shv Konstantin Shvachko added a comment - Hi Kihwal Lee , thanks for the data. Shows your point well. I have a 2600-node cluster, where we've been running Balancer pretty actively. This was initially based on 2.6, but now pretty close to branch-2.7, and with HDFS-8818 applied. Here is the graph, thanks Erik Krogen for generating it: The maximum number of transfers is 50 - close to yours. We do not allow the Balancer to run "full speed" to allow room for normal workload on the cluster. So I am not completely sure that HDFS-8818 is the only thing to blame for the slowness you described. I did not run branch-2.8 Balancer, but I plan to test on branch-2.7 in the very near future. Also looked at your patch. Sounds like a reasonable change to me. But may be we want to understand more closely at the root cause of the behavior before we commit?
          Hide
          kihwal Kihwal Lee added a comment -

          But may be we want to understand more closely at the root cause of the behavior before we commit?

          HDFS-8818 made a thread pool to be created per target. The size of each thread pool is fixed to dfs.datanode.balance.max.concurrent.moves(default=50). When the number of targets * max concurrent moves exceeds the configured mover thread limit (default=1000), it stops creating thread pools. The end result is that not all calculated moves are executed.

          So when the default config is used, if an iteration involves more than 20 targets (1000/50), some blocks won't be moved. We have clusters where balancing involves hundreds of targets and the effect of this limitation is very visible. My patch dynamically sizes each thread pool based on the number of targets in the iteration and the configured maximum mover threads to mitigate this problem. It does not limit the capability of HDFS-8818. It simply makes it possible to continue to use the default value or adjust it without much side effect.

          Show
          kihwal Kihwal Lee added a comment - But may be we want to understand more closely at the root cause of the behavior before we commit? HDFS-8818 made a thread pool to be created per target. The size of each thread pool is fixed to dfs.datanode.balance.max.concurrent.moves (default=50). When the number of targets * max concurrent moves exceeds the configured mover thread limit (default=1000), it stops creating thread pools. The end result is that not all calculated moves are executed. So when the default config is used, if an iteration involves more than 20 targets (1000/50), some blocks won't be moved. We have clusters where balancing involves hundreds of targets and the effect of this limitation is very visible. My patch dynamically sizes each thread pool based on the number of targets in the iteration and the configured maximum mover threads to mitigate this problem. It does not limit the capability of HDFS-8818 . It simply makes it possible to continue to use the default value or adjust it without much side effect.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Kihwal Lee, do you think this is a blocker for 2.7.4 - it's marked as such?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Kihwal Lee , do you think this is a blocker for 2.7.4 - it's marked as such?
          Hide
          kihwal Kihwal Lee added a comment -

          do you think this is a blocker for 2.7.4 - it's marked as such?

          It caused performance regression for us. The impact depends on the scale and the nature of imbalance. We would not run the new balancer (now back-ported to 2.7.4) without this.

          Show
          kihwal Kihwal Lee added a comment - do you think this is a blocker for 2.7.4 - it's marked as such? It caused performance regression for us. The impact depends on the scale and the nature of imbalance. We would not run the new balancer (now back-ported to 2.7.4) without this.
          Hide
          shv Konstantin Shvachko added a comment -

          Hey Kihwal Lee, I tested the 2.7.4 Balancer on a small cluster (120 nodes) with and without your patch. I could not reproduce the degradation of performance. The ReplaceBlockOpNumOps were the same in both cases, lower than on the large cluster, but not anything close to the lows as on your graph. Sorry to say this, but I tried. Probably a different type of imbalance or configuration parameters. So I don't see this as a blocker.
          But I also reviewed your patch and found the change reasonable, as I mentioned before. So we can commit it, especially as you feel strong about it. My formal +1 for the patch.
          It is better to commit it promptly. I am waiting on the last blocker HDFS-11896 before starting the RC process.

          Show
          shv Konstantin Shvachko added a comment - Hey Kihwal Lee , I tested the 2.7.4 Balancer on a small cluster (120 nodes) with and without your patch. I could not reproduce the degradation of performance. The ReplaceBlockOpNumOps were the same in both cases, lower than on the large cluster, but not anything close to the lows as on your graph. Sorry to say this, but I tried. Probably a different type of imbalance or configuration parameters. So I don't see this as a blocker. But I also reviewed your patch and found the change reasonable, as I mentioned before. So we can commit it, especially as you feel strong about it. My formal +1 for the patch. It is better to commit it promptly. I am waiting on the last blocker HDFS-11896 before starting the RC process.
          Hide
          kihwal Kihwal Lee added a comment -

          Thanks Konstantin Shvachko for testing and review. By default, there will be 1000 mover threads and 50 of them will be assigned to each move pair. If the cluster has significantly more than 1000/50 = 20 pairs being scheduled per iteration, there will be performance regression.

          I will commit it shortly.

          Show
          kihwal Kihwal Lee added a comment - Thanks Konstantin Shvachko for testing and review. By default, there will be 1000 mover threads and 50 of them will be assigned to each move pair. If the cluster has significantly more than 1000/50 = 20 pairs being scheduled per iteration, there will be performance regression. I will commit it shortly.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12043 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12043/)
          HDFS-11742. Improve balancer usability after HDFS-8818. Contributed by (kihwal: rev 8e3a992eccff26a7344c3f0e719898fa97706b8c)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12043 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12043/ ) HDFS-11742 . Improve balancer usability after HDFS-8818 . Contributed by (kihwal: rev 8e3a992eccff26a7344c3f0e719898fa97706b8c) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java
          Hide
          kihwal Kihwal Lee added a comment -

          I've committed this to trunk, branch-2, branch-2.8 and branch-2.7. Did not commit it to branch-2.8.2 as it was missing dependencies. If it doesn't get re-branched, I will pull it in to 2.8.2 with all dependencies.

          Show
          kihwal Kihwal Lee added a comment - I've committed this to trunk, branch-2, branch-2.8 and branch-2.7. Did not commit it to branch-2.8.2 as it was missing dependencies. If it doesn't get re-branched, I will pull it in to 2.8.2 with all dependencies.

            People

            • Assignee:
              kihwal Kihwal Lee
              Reporter:
              kihwal Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development