Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6010

Make balancer able to balance data among specified servers

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: balancer & mover
    • Labels:

      Description

      Currently, the balancer tool balances data among all datanodes. However, in some particular case, we would need to balance data only among specified nodes instead of the whole set.

      In this JIRA, a new "-servers" option would be introduced to implement this.

      1. HDFS-6010-trunk_V2.patch
        20 kB
        Yu Li
      2. HDFS-6010-trunk.patch
        16 kB
        Yu Li

        Issue Links

          Activity

          Hide
          Yu Li added a comment -

          Attach the first patch against trunk, below is the test-patch result on my local env:

          -1 overall.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          -1 release audit. The applied patch generated 3624 release audit warnings.

          Show
          Yu Li added a comment - Attach the first patch against trunk, below is the test-patch result on my local env: -1 overall . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. -1 release audit . The applied patch generated 3624 release audit warnings.
          Hide
          Yu Li added a comment -

          Hi Devaraj Das,

          Sorry for bothering but I noticed you contributed HDFS-2576, and since the patch here is a tool for an io-isolation solution based on favored node feature, could you help review? I've also submit a rb request here

          Thanks in advance!

          Show
          Yu Li added a comment - Hi Devaraj Das , Sorry for bothering but I noticed you contributed HDFS-2576 , and since the patch here is a tool for an io-isolation solution based on favored node feature, could you help review? I've also submit a rb request here Thanks in advance!
          Hide
          Devaraj Das added a comment -

          Ok.. will do so Yu Li.

          Show
          Devaraj Das added a comment - Ok.. will do so Yu Li .
          Hide
          Yu Li added a comment -

          Hi Devaraj Das,

          Any comments? Or is it ok for me to submit the patch for hadoop QA to test? Thanks.

          Show
          Yu Li added a comment - Hi Devaraj Das , Any comments? Or is it ok for me to submit the patch for hadoop QA to test? Thanks.
          Hide
          Devaraj Das added a comment -

          Sorry Yu Li, for the delay. I skimmed the patch but I'd like to have someone from the HDFS side to look at it more closely. (CC Tsz Wo Nicholas Sze). I also am not clear on how this would be used in practical scenarios. In the favored node case, the thing that we were toying with was whether it is possible to have the balancer not disturb the placements of blocks if possible. And, if the balancer had to make some undesirable moves, then it's okay - in cases of applications like HBase, compactions would rewrite the data and would create the blocks on some set of favored nodes again.

          Show
          Devaraj Das added a comment - Sorry Yu Li , for the delay. I skimmed the patch but I'd like to have someone from the HDFS side to look at it more closely. (CC Tsz Wo Nicholas Sze ). I also am not clear on how this would be used in practical scenarios. In the favored node case, the thing that we were toying with was whether it is possible to have the balancer not disturb the placements of blocks if possible. And, if the balancer had to make some undesirable moves, then it's okay - in cases of applications like HBase, compactions would rewrite the data and would create the blocks on some set of favored nodes again.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ..., in some particular case, we would need to balance data only among specified nodes instead of the whole set.

          Hi Yu, it is also not clear to me the use cases. Could you describe it in more details?

          Show
          Tsz Wo Nicholas Sze added a comment - > ..., in some particular case, we would need to balance data only among specified nodes instead of the whole set. Hi Yu, it is also not clear to me the use cases. Could you describe it in more details?
          Hide
          Yu Li added a comment -

          Thanks Devaraj Das for the reply and CC Nicholas!

          Hi Tsz Wo Nicholas Sze, Thanks for taking a look here. I found your question similar with Das', so I'd like to answer in one go. The background are described in HDFS-6009, allow me to quote here:

          There're scenarios like mentioned in HBASE-6721 and HBASE-4210 that in multi-tenant deployments of HBase we prefer to specify several groups of regionservers to serve different applications, to achieve some kind of isolation or resource allocation. However, although the regionservers are grouped, the datanodes which store the data are not, which leads to the case that one datanode failure affects multiple applications, as we already observed in our product environment.

          To relieve the above issue, we could take usage of the favored node feature (HDFS-2576) to make regionserver able to locate data within its group, or say make datanodes also grouped (passively), to form some level of isolation.

          In this case, or any other case that needs datanodes to group, we would need a bunch of tools to maintain the "group", including:
          1. Making balancer able to balance data among specified servers, rather than the whole set
          2. Set balance bandwidth for specified servers, rather than the whole set
          3. Some tool to check whether the block is "cross-group" placed, and move it back if so

          People might ask in this case why don't use phasically separated clusters, the answer would be it's more convenient and saves people resource to manage one big cluster than several small ones.

          I also know there's other solution like HDFS-5776 to reduce the impact of bad datanode, but I believe there're still scenarios which need more strict io isolation, so I think it's still valuable to contribute our tools.

          In case of undesirable moves caused by HBase compaction-like operation or replication caused by disk-damage, we could supply tool like described in HDFS-6012 to check and move the "cross-group" blocks back.

          Let me know if any comments.

          Show
          Yu Li added a comment - Thanks Devaraj Das for the reply and CC Nicholas! Hi Tsz Wo Nicholas Sze , Thanks for taking a look here. I found your question similar with Das', so I'd like to answer in one go. The background are described in HDFS-6009 , allow me to quote here: There're scenarios like mentioned in HBASE-6721 and HBASE-4210 that in multi-tenant deployments of HBase we prefer to specify several groups of regionservers to serve different applications, to achieve some kind of isolation or resource allocation. However, although the regionservers are grouped, the datanodes which store the data are not, which leads to the case that one datanode failure affects multiple applications, as we already observed in our product environment. To relieve the above issue, we could take usage of the favored node feature ( HDFS-2576 ) to make regionserver able to locate data within its group, or say make datanodes also grouped (passively), to form some level of isolation. In this case, or any other case that needs datanodes to group, we would need a bunch of tools to maintain the "group", including: 1. Making balancer able to balance data among specified servers, rather than the whole set 2. Set balance bandwidth for specified servers, rather than the whole set 3. Some tool to check whether the block is "cross-group" placed, and move it back if so People might ask in this case why don't use phasically separated clusters, the answer would be it's more convenient and saves people resource to manage one big cluster than several small ones. I also know there's other solution like HDFS-5776 to reduce the impact of bad datanode, but I believe there're still scenarios which need more strict io isolation, so I think it's still valuable to contribute our tools. In case of undesirable moves caused by HBase compaction-like operation or replication caused by disk-damage, we could supply tool like described in HDFS-6012 to check and move the "cross-group" blocks back. Let me know if any comments.
          Hide
          Yu Li added a comment -

          Hi Tsz Wo Nicholas Sze,

          How do you think about the use case? Does it make sense to you? If so, is it ok for me to submit the patch for hadoop QA to test? Thanks.

          Show
          Yu Li added a comment - Hi Tsz Wo Nicholas Sze , How do you think about the use case? Does it make sense to you? If so, is it ok for me to submit the patch for hadoop QA to test? Thanks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Yu, it does make sense to me although I do not familiar with HBase deployment. Thanks for clarifying it.

          Devaraj Das, do you have any comment?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Yu, it does make sense to me although I do not familiar with HBase deployment. Thanks for clarifying it. Devaraj Das , do you have any comment?
          Hide
          Yu Li added a comment -

          Hi Devaraj Das, it seems we are waiting for your comment here.

          Tsz Wo Nicholas Sze, any review points about the patch attached here? Or we need to wait for Das' comments before starting the code review? Thanks.

          Show
          Yu Li added a comment - Hi Devaraj Das , it seems we are waiting for your comment here. Tsz Wo Nicholas Sze , any review points about the patch attached here? Or we need to wait for Das' comments before starting the code review? Thanks.
          Hide
          Devaraj Das added a comment -

          Yu Li, sorry for the delay in getting back. You know how things work when there are deadlines to meet I have some follow up questions for my understanding.

          1. How would you maintain the mapping of files to groups? (for the HDFS-6012 to work). If the mapping is maintained, wondering whether it makes sense to have the tool take paths for balancing as opposed to servers. Then maybe you can also combine the tool that does group management (HDFS-6012) into the balancer.
          2. Are these mappings set up by some admin?
          3. Would you expand a group when it is nearing capacity?
          4. How does someone like HBase use this? Is HBase going to have visibility into the mappings as well (to take care of HBASE-6721 and favored-nodes for writes)?
          5. Would you need a higher level balancer for keeping the whole cluster balanced (do migrations of blocks associated with certain paths from one group to another)? Otherwise, there would be skews in the block distribution.
          6. When there is a failure of a datanode in a group, how would you choose which datanodes to replicate the blocks to. The choice would be somewhat important given that some target datanodes might be busy serving requests for apps for its group. Adding some more work to these datanodes might make apps in the other group suffer. But maybe it's not that big a deal. On the other hand, if the group still has capacity, and the failure zones are still intact for the members in the group, then the replication could take into account the mapping in (1).

          Show
          Devaraj Das added a comment - Yu Li , sorry for the delay in getting back. You know how things work when there are deadlines to meet I have some follow up questions for my understanding. 1. How would you maintain the mapping of files to groups? (for the HDFS-6012 to work). If the mapping is maintained, wondering whether it makes sense to have the tool take paths for balancing as opposed to servers. Then maybe you can also combine the tool that does group management ( HDFS-6012 ) into the balancer. 2. Are these mappings set up by some admin? 3. Would you expand a group when it is nearing capacity? 4. How does someone like HBase use this? Is HBase going to have visibility into the mappings as well (to take care of HBASE-6721 and favored-nodes for writes)? 5. Would you need a higher level balancer for keeping the whole cluster balanced (do migrations of blocks associated with certain paths from one group to another)? Otherwise, there would be skews in the block distribution. 6. When there is a failure of a datanode in a group, how would you choose which datanodes to replicate the blocks to. The choice would be somewhat important given that some target datanodes might be busy serving requests for apps for its group. Adding some more work to these datanodes might make apps in the other group suffer. But maybe it's not that big a deal. On the other hand, if the group still has capacity, and the failure zones are still intact for the members in the group, then the replication could take into account the mapping in (1).
          Hide
          Yu Li added a comment -

          You know how things work when there are deadlines to meet

          Totally understand, no problem


          1. How would you maintain the mapping of files to groups?

          We don't maintain the mapping in HDFS, but use the regionserver group information. Or say, in our use case, this is used along with the regionserver group feature, the admin can get the RS group information through a hbase shell command, and pass the server list to balancer. To make it easier, we actually wrote a simple script to do the whole process, while admin only need to enter a RS group name for data balancing. More details please refer to answer of question #4

          wondering whether it makes sense to have the tool take paths for balancing as opposed to servers

          In our hbase use case, this is Ok. But I think it might be better to make the tool more general. There might be other scenarios requring balancing data among subset instead of fullset of datanodes, although I cannot give one for now.


          2. Are these mappings set up by some admin?

          Yes according to above comments


          3. Would you expand a group when it is nearing capacity?

          Yes, we could change the setting of one RS group, like moving one RS from groupA to groupB, then we would need to use the HDFS-6012 tool to move blocks to assure "group-block-locality". We'll come back more about this topic in answer of question #5


          4. How does someone like HBase use this? Is HBase going to have visibility into the mappings as well (to take care of HBASE-6721 and favored-nodes for writes)?

          Yes, through HBASE-6721(actually we have done quite some improvements to it to make it simplier and more suitable to use in our product env, but that's another topic and won't discuss here) we could group RS to supply multi-tenant service, one application would use one RS group(regions of all tables of this application would be served only by RS in its own group), and would write data to the mapping DN through favored-node feature. To be more specific, it's an "app-regionserverGroup-datanodeGroup" mapping, all hfiles of the table of one application would locate only on the DNs of the RS group.


          5. Would you need a higher level balancer for keeping the whole cluster balanced (do migrations of blocks associated with certain paths from one group to another)? Otherwise, there would be skews in the block distribution.

          You really have got the point here Actually the most downside of this solution for io isolation is that it will cause data imbalance in the view of the whole HDFS cluster. In our use case, we recommend admin not to use balancer over all DNs. Instead, like mentioned in answer of question #3, if we find some group with high disk usage while another group relatively "empty", admin can reset the group to move one RS/DN server around. HDFS-6010 tool plus HDFS-6012 tool would make the trick work.


          6. When there is a failure of a datanode in a group, how would you choose which datanodes to replicate the blocks to. The choice would be somewhat important given that some target datanodes might be busy serving requests

          Currently we don't control the replication of failed datanodes, but use the HDFS default policy. So the only impact datanode failure does for isolation is that the blocks might be replicated outside the group, that's why we need HDFS-6012 tool to periodly check and move "cross-group" blocks back


          Devaraj Das hope the above comments could answer your questions, and feel free to let me know if any further comments.

          Show
          Yu Li added a comment - You know how things work when there are deadlines to meet Totally understand, no problem 1. How would you maintain the mapping of files to groups? We don't maintain the mapping in HDFS, but use the regionserver group information. Or say, in our use case, this is used along with the regionserver group feature, the admin can get the RS group information through a hbase shell command, and pass the server list to balancer. To make it easier, we actually wrote a simple script to do the whole process, while admin only need to enter a RS group name for data balancing. More details please refer to answer of question #4 wondering whether it makes sense to have the tool take paths for balancing as opposed to servers In our hbase use case, this is Ok. But I think it might be better to make the tool more general. There might be other scenarios requring balancing data among subset instead of fullset of datanodes, although I cannot give one for now. 2. Are these mappings set up by some admin? Yes according to above comments 3. Would you expand a group when it is nearing capacity? Yes, we could change the setting of one RS group, like moving one RS from groupA to groupB, then we would need to use the HDFS-6012 tool to move blocks to assure "group-block-locality". We'll come back more about this topic in answer of question #5 4. How does someone like HBase use this? Is HBase going to have visibility into the mappings as well (to take care of HBASE-6721 and favored-nodes for writes)? Yes, through HBASE-6721 (actually we have done quite some improvements to it to make it simplier and more suitable to use in our product env, but that's another topic and won't discuss here ) we could group RS to supply multi-tenant service, one application would use one RS group(regions of all tables of this application would be served only by RS in its own group), and would write data to the mapping DN through favored-node feature. To be more specific, it's an "app-regionserverGroup-datanodeGroup" mapping, all hfiles of the table of one application would locate only on the DNs of the RS group. 5. Would you need a higher level balancer for keeping the whole cluster balanced (do migrations of blocks associated with certain paths from one group to another)? Otherwise, there would be skews in the block distribution. You really have got the point here Actually the most downside of this solution for io isolation is that it will cause data imbalance in the view of the whole HDFS cluster. In our use case, we recommend admin not to use balancer over all DNs. Instead, like mentioned in answer of question #3, if we find some group with high disk usage while another group relatively "empty", admin can reset the group to move one RS/DN server around. HDFS-6010 tool plus HDFS-6012 tool would make the trick work. 6. When there is a failure of a datanode in a group, how would you choose which datanodes to replicate the blocks to. The choice would be somewhat important given that some target datanodes might be busy serving requests Currently we don't control the replication of failed datanodes, but use the HDFS default policy. So the only impact datanode failure does for isolation is that the blocks might be replicated outside the group, that's why we need HDFS-6012 tool to periodly check and move "cross-group" blocks back Devaraj Das hope the above comments could answer your questions, and feel free to let me know if any further comments.
          Hide
          Devaraj Das added a comment -

          Sanjay Radia, could you please take a look at the proposal here.

          Show
          Devaraj Das added a comment - Sanjay Radia , could you please take a look at the proposal here.
          Hide
          Yu Li added a comment -

          Hi Tsz Wo Nicholas Sze, Sanjay Radia and Devaraj Das,

          Is it ok for me to submit the patch? Or any more review comments?

          Show
          Yu Li added a comment - Hi Tsz Wo Nicholas Sze , Sanjay Radia and Devaraj Das , Is it ok for me to submit the patch? Or any more review comments?
          Hide
          Devaraj Das added a comment -

          Go ahead and submit, Yu.

          Show
          Devaraj Das added a comment - Go ahead and submit, Yu.
          Hide
          Yu Li added a comment -

          Submitting patch for hadoop QA to test.

          Show
          Yu Li added a comment - Submitting patch for hadoop QA to test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12630917/HDFS-6010-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.balancer.TestBalancer

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6425//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6425//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630917/HDFS-6010-trunk.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancer +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6425//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6425//console This message is automatically generated.
          Hide
          Yu Li added a comment -

          The UT failure is caused by a bug of TestBalancer, here is detailed analysis:

          Let's look into the code logic of testUnevenDistribution: If number of datanode of the mini-cluster is 3(or larger), the replication factor will be set to 2(or more), and generateBlocks will generate a file with it, say the block number will equal to (targetSize/replicationFactor)/blockSize. Then distributeBlock will double the block number through below codes:

              for(int i=0; i<blocks.length; i++) {
                for(int j=0; j<replicationFactor; j++) {
                  boolean notChosen = true;
                  while(notChosen) {
                    int chosenIndex = r.nextInt(usedSpace.length);
                    if( usedSpace[chosenIndex]>0 ) {
                      notChosen = false;
                      blockReports.get(chosenIndex).add(blocks[i].getLocalBlock());
                      usedSpace[chosenIndex] -= blocks[i].getNumBytes();
                    }
                  }
                }
              }
          

          Notice that this distribution cannot prevent replicated blocks on the same datanode. And then, while invoking the MiniDFSCluster#injectBlocks(actually SimulatedFSDataset#injectBlocks) method, the duplicated blocks would get removed according to below code segment

          SimulatedFSDataset#injectBlocks
            public synchronized void injectBlocks(String bpid,
                Iterable<Block> injectBlocks) throws IOException {
              ExtendedBlock blk = new ExtendedBlock();
              if (injectBlocks != null) {
                for (Block b: injectBlocks) { // if any blocks in list is bad, reject list
                  if (b == null) {
                    throw new NullPointerException("Null blocks in block list");
                  }
                  blk.set(bpid, b);
                  if (isValidBlock(blk)) {
                    throw new IOException("Block already exists in  block list");
                  }
                }
                Map<Block, BInfo> map = blockMap.get(bpid);
                if (map == null) {
                  map = new HashMap<Block, BInfo>();
                  blockMap.put(bpid, map);
                }
                for (Block b: injectBlocks) {
                  BInfo binfo = new BInfo(bpid, b, false);
                  map.put(binfo.theBlock, binfo);
                }
              }
            }
          

          This will cause the used space less than what is expected thus cause testing failure. The issue was hidden because in existing tests the datanode number was never set to larger than 2. It would be easy to reproduce the issue simply by increasing the datanode number of TestBalancer#testBalancer1Internal from 2 to 3, like

          TestBalancer#testBalancer1Internal
            void testBalancer1Internal(Configuration conf) throws Exception {
              initConf(conf);
              testUnevenDistribution(conf,
                  new long[] {90*CAPACITY/100, 50*CAPACITY/100, 10*CAPACITY/100},
                  new long[] {CAPACITY, CAPACITY, CAPACITY},
                  new String[] {RACK0, RACK1, RACK2});
            }
          

          I've tried to refine the distribution method, however I found it hard to make it general. To make sure no duplicated blocks assigned to the same datanode, we must make sure the largest distribution less than sum of the other distributions

          After a second thought, I even don't think it necessary to involve replication factor into the balancer testing. Maybe the UT designer was thinking about testing balancer manner when there's also replication ongoing, but unfortunately the current design cannot reveal this. So personally, I propose to always set replication factor to 1 in TestBalancer

          Show
          Yu Li added a comment - The UT failure is caused by a bug of TestBalancer, here is detailed analysis: Let's look into the code logic of testUnevenDistribution: If number of datanode of the mini-cluster is 3(or larger), the replication factor will be set to 2(or more), and generateBlocks will generate a file with it, say the block number will equal to (targetSize/replicationFactor)/blockSize. Then distributeBlock will double the block number through below codes: for ( int i=0; i<blocks.length; i++) { for ( int j=0; j<replicationFactor; j++) { boolean notChosen = true ; while (notChosen) { int chosenIndex = r.nextInt(usedSpace.length); if ( usedSpace[chosenIndex]>0 ) { notChosen = false ; blockReports.get(chosenIndex).add(blocks[i].getLocalBlock()); usedSpace[chosenIndex] -= blocks[i].getNumBytes(); } } } } Notice that this distribution cannot prevent replicated blocks on the same datanode. And then, while invoking the MiniDFSCluster#injectBlocks(actually SimulatedFSDataset#injectBlocks) method, the duplicated blocks would get removed according to below code segment SimulatedFSDataset#injectBlocks public synchronized void injectBlocks( String bpid, Iterable<Block> injectBlocks) throws IOException { ExtendedBlock blk = new ExtendedBlock(); if (injectBlocks != null ) { for (Block b: injectBlocks) { // if any blocks in list is bad, reject list if (b == null ) { throw new NullPointerException( "Null blocks in block list" ); } blk.set(bpid, b); if (isValidBlock(blk)) { throw new IOException( "Block already exists in block list" ); } } Map<Block, BInfo> map = blockMap.get(bpid); if (map == null ) { map = new HashMap<Block, BInfo>(); blockMap.put(bpid, map); } for (Block b: injectBlocks) { BInfo binfo = new BInfo(bpid, b, false ); map.put(binfo.theBlock, binfo); } } } This will cause the used space less than what is expected thus cause testing failure. The issue was hidden because in existing tests the datanode number was never set to larger than 2 . It would be easy to reproduce the issue simply by increasing the datanode number of TestBalancer#testBalancer1Internal from 2 to 3, like TestBalancer#testBalancer1Internal void testBalancer1Internal(Configuration conf) throws Exception { initConf(conf); testUnevenDistribution(conf, new long [] {90*CAPACITY/100, 50*CAPACITY/100, 10*CAPACITY/100}, new long [] {CAPACITY, CAPACITY, CAPACITY}, new String [] {RACK0, RACK1, RACK2}); } I've tried to refine the distribution method, however I found it hard to make it general. To make sure no duplicated blocks assigned to the same datanode, we must make sure the largest distribution less than sum of the other distributions After a second thought, I even don't think it necessary to involve replication factor into the balancer testing. Maybe the UT designer was thinking about testing balancer manner when there's also replication ongoing, but unfortunately the current design cannot reveal this. So personally, I propose to always set replication factor to 1 in TestBalancer
          Hide
          Yu Li added a comment -

          Attach the new patch with fix of the UT failure as mentioned above, and resubmit patch for hadoop QA to test

          Show
          Yu Li added a comment - Attach the new patch with fix of the UT failure as mentioned above, and resubmit patch for hadoop QA to test
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636334/HDFS-6010-trunk_V2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6474//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6474//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636334/HDFS-6010-trunk_V2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6474//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6474//console This message is automatically generated.
          Hide
          Yu Li added a comment -

          Hi Tsz Wo Nicholas Sze,

          Since hadoop QA test has passed, could you please help review and commit this patch? This patch introduced a new class NodeStringValidator.java to validate whether a given string could identify a valid datanode, and HDFS-6011/HDFS-6012 all depend on it. I could upload the patches for the other two JIRAs right after this JIRA is committed thus finish contributing the whole tool set as mentioned in HDFS-6009. Thanks!

          Show
          Yu Li added a comment - Hi Tsz Wo Nicholas Sze , Since hadoop QA test has passed, could you please help review and commit this patch? This patch introduced a new class NodeStringValidator.java to validate whether a given string could identify a valid datanode, and HDFS-6011 / HDFS-6012 all depend on it. I could upload the patches for the other two JIRAs right after this JIRA is committed thus finish contributing the whole tool set as mentioned in HDFS-6009 . Thanks!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Will do. I should be able to review the patch later this week or early next week.

          Show
          Tsz Wo Nicholas Sze added a comment - Will do. I should be able to review the patch later this week or early next week.
          Hide
          Yu Li added a comment -

          Ok, thanks in advance Tsz Wo Nicholas Sze

          Show
          Yu Li added a comment - Ok, thanks in advance Tsz Wo Nicholas Sze
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The patch is generally good. Some comments:

          • I think "-datanodes" may be a better name than "-servers". However, I actually suggest not adding it as a CLI parameter since, for a large cluster, it may not be easy to specify all the selected datanodes in CLI. How about adding a new conf property, say dfs.balancer.selectedDatanodes?
          • The new class NodeStringValidator is unlikely to be used outside Balancer. How about moving it to the balancer package and renaming it to BalancerUtil?
          • In initNodes(..), if target == null, it will throw an IllegalArgumentException. However, a balancer may run for a long time and some datanodes could be down. I think we should not throw exceptions. Perhaps, printing a warning is good enough.
            • The new code could be moved to a static method (in BalancerUtil) so that it is earlier to read.

          I have not yet checked NodeStringValidator and the new tests in details.

          Show
          Tsz Wo Nicholas Sze added a comment - The patch is generally good. Some comments: I think "-datanodes" may be a better name than "-servers". However, I actually suggest not adding it as a CLI parameter since, for a large cluster, it may not be easy to specify all the selected datanodes in CLI. How about adding a new conf property, say dfs.balancer.selectedDatanodes? The new class NodeStringValidator is unlikely to be used outside Balancer. How about moving it to the balancer package and renaming it to BalancerUtil? In initNodes(..), if target == null, it will throw an IllegalArgumentException. However, a balancer may run for a long time and some datanodes could be down. I think we should not throw exceptions. Perhaps, printing a warning is good enough. The new code could be moved to a static method (in BalancerUtil) so that it is earlier to read. I have not yet checked NodeStringValidator and the new tests in details.
          Hide
          Yu Li added a comment -

          Thanks for the review and comments Tsz.

          I think "-datanodes" may be a better name than "-servers"...How about adding a new conf property, say dfs.balancer.selectedDatanodes?

          IMHO, by making it an option in CLI, user could dynamically choose which nodes to balance among, while property is static. In our use case, the admin might balance groupA and groupB separately, and an option in CLI would make it easier, right?
          Agree to rename the option as "-datanodes" if we decided to still use option in CLI.

          How about moving it to the balancer package and renaming it to BalancerUtil?

          Agree to move it to balancer package. About the name, since currently it's only for validating whether a given string matches a live datanode, it seems to me the name "BalancerUtil" is too big.

          a balancer may run for a long time and some datanodes could be down. I think we should not throw exceptions. Perhaps, printing a warning is good enough

          It's true tat some datanodes could be down, but I'd like to discuss more about this scenario. Assuming groupA has 3 nodes and node #1 is down. When admin issue command like "-datanodes 1,2,3", he means to make data distribution got balanced across the 3 nodes. If we only print warnings, then it will balance data between node #2 and #3 firstly, then after node #1 is back, the admin has to do another round of balancing. Since each balance would add read lock to involved blocks and cause disk/network IO, in our product env we would prefer to fail the first trial and wait until all datanodes back. So I'd like to ask for a second thought on whether to throw exception or print warning here.

          The new code could be moved to a static method (in BalancerUtil) so that it is earlier to read.

          Agree, will refine the code no matter whether we need to change from throwing exception to printing warning

          I have not yet checked NodeStringValidator and the new tests in details

          No problem, will wait for your comments and update the patch in one go, along with all changes required after above discussion.

          Show
          Yu Li added a comment - Thanks for the review and comments Tsz. I think "-datanodes" may be a better name than "-servers"...How about adding a new conf property, say dfs.balancer.selectedDatanodes? IMHO, by making it an option in CLI, user could dynamically choose which nodes to balance among, while property is static. In our use case, the admin might balance groupA and groupB separately, and an option in CLI would make it easier, right? Agree to rename the option as "-datanodes" if we decided to still use option in CLI. How about moving it to the balancer package and renaming it to BalancerUtil? Agree to move it to balancer package. About the name, since currently it's only for validating whether a given string matches a live datanode, it seems to me the name "BalancerUtil" is too big. a balancer may run for a long time and some datanodes could be down. I think we should not throw exceptions. Perhaps, printing a warning is good enough It's true tat some datanodes could be down, but I'd like to discuss more about this scenario. Assuming groupA has 3 nodes and node #1 is down. When admin issue command like "-datanodes 1,2,3", he means to make data distribution got balanced across the 3 nodes. If we only print warnings, then it will balance data between node #2 and #3 firstly, then after node #1 is back, the admin has to do another round of balancing. Since each balance would add read lock to involved blocks and cause disk/network IO, in our product env we would prefer to fail the first trial and wait until all datanodes back. So I'd like to ask for a second thought on whether to throw exception or print warning here. The new code could be moved to a static method (in BalancerUtil) so that it is earlier to read. Agree, will refine the code no matter whether we need to change from throwing exception to printing warning I have not yet checked NodeStringValidator and the new tests in details No problem, will wait for your comments and update the patch in one go, along with all changes required after above discussion.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Yu Li, have you looked at HDFS-6441?

          Show
          Tsz Wo Nicholas Sze added a comment - Yu Li , have you looked at HDFS-6441 ?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636334/HDFS-6010-trunk_V2.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7343//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636334/HDFS-6010-trunk_V2.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7343//console This message is automatically generated.
          Hide
          Arpit Agarwal added a comment -

          I am resolving this as a duplicate of HDFS-6441 since the changes were incorporated. I co-credited Yu Li for HDFS-6441.

          Show
          Arpit Agarwal added a comment - I am resolving this as a duplicate of HDFS-6441 since the changes were incorporated. I co-credited Yu Li for HDFS-6441 .

            People

            • Assignee:
              Yu Li
              Reporter:
              Yu Li
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development