Andrew Wang I really appreciate you taking time out to read the proposal and provide such detailed feedback.
I really like separating planning from execution, since it'll make the unit tests actual unit tests (no minicluster!). This is a real issue with the existing balancer, the tests take forever to run and don't converge reliably.
Thank you, that was the intent.
HDFS-1804 has been around for a while and used successfully by many of our users, so "lack of real world data or adoption" is not entirely correct. We've even considered making it the default, see HDFS-8538 where the consensus was that we could do this if we add some additional throttling.
Thanks for correcting me and reference to HDFS-8538. I know folks at Cloudera write some excellent engineering blog posts and papers, I would love to read your experiences with using
IMO the usecase to focus on is the addition of fresh drives, particularly in the context of hotswap. I'm unconvinced that intra-node imbalance happens naturally when
HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a cluster is commonly suffering from intra-DN imbalance (e.g. from differently sized disks on a node). This means we should only see intra-node imbalance on admin action like adding a new drive; a singular, administrator-triggered operation.
Our own experience from the field is that many customers routinely run into this issue, with and without new drives being added, so we are tackling both use cases.
However, I don't understand the need for cluster-wide reporting and orchestration.
Just to make sure we are in same page, there is no cluster-wide orchestration. The cluster-wide reporting allows admins to see which nodes need disk balancing. Not all clusters are running
HDFS-1804, and hence the ability to discover which machines need disk-balancing is useful for a large set of customers.
I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion.
That is how it is, HDFS-1312 is not complete and you will see all the data movement is indeed inside the datanode.
Namely, this avoids the Discover step, simplifying things. There's also no global planning step. What I'm envisioning is a user experience where admin just points it at a DN, like:
hdfs balancer -volumes -datanode 126.96.36.199:50070
[Took the liberty to re-order some of your comments since both of these are best answered together]
Completely agree with the proposed command, in fact that is one of the commands that will be part of the tool. As you pointed out that is the simplest use case.
The reason for discovery is due to many considerations.
- This discover phase is needed even if we have to process a single node, we have to read the information about that node. I understand that you are pointing out that we may not need the info for the whole cluster, please see my next point.
- Discover phase makes coding simpler since we are able to rely on current balancer code in NameNodeConnector, and avoids us having to write any new code. For the approach that you are suggesting we will have to add more RPCs to datanode and for a rarely used administrative tool it seemed like an overkill, when balancer already provided a way for us to do it. if you look at current code in HDFS-1312, you will see that discover is merely a pass-thru to the Balancer#NameNodeConnector.
- With discover approach, we can take a snapshot of the nodes and cluster, that allows us to report to the admin what changes were done by us after the move. This is pretty useful. Since we have cluster-wide data with us it also allows us to report to an administrator which nodes need his/her attention (this is just a sort and print). As I mentioned earlier, unfortunately there is a large number of customers that still do not use
HDFS-1804 and it is a beneficial feature for them.
- Last but most important from my personal point of view, this allows testing and error reporting for disk balancer to much more easier. Let us say we find a bug in disk-balancer, a customer could just provide the disk-balancer cluster json discovered by diskbalancer and we can debug the issue off-line. During the the development phase, that is how I have been testing disk balancer, by taking a snapshot of real clusters and then feeding that data back into disk-balancer via a connector called JsonNodeConnector.
As I said earlier, the most generic use case I have in mind is the one you already described, where the user wants to just point the balancer at a datanode, and we will support that use case.
There's also no global planning step, I think most of this functionality should live in the DN since it's better equipped to do IO throttling and mutual exclusion. Basically we'd send an RPC to tell the DN to balance itself (with parameters), and then poll another RPC to watch the status and wait until it's done.
We have to have a plan – either we do that inside the datanode or do it outside and submit the plan to the datanode. With doing it outside we are able to test the planning phase independently. Please look at the test cases in TestPlanner.java we are able to test for both scale and correctness of the planner easily, with moving that into datanode we will get into the issue that we can only test a plan with miniDFSCluster, which we both agree is painful.
This also has an added benefit that an admin gets an opportunity to review the plan if needed, and supports the other use case where you can use this tool as block mover. Eventually we are hoping that this code will merge with mover.
On the topic of the actual balancing, how do we atomically move the block in the presence of failures? Right now the NN expects only one replica per DN, so if the same replica is on multiple volumes of the DN, we could run into issues. See related (quite serious) issues like
HDFS-7443 and HDFS-7960. I think we can do some tricks with a temp filename and rename, but this procedure should be carefully explained.
Thanks for the pointers. We have very little new code here, we rely on the mover logic, in fact what we have a thin wrapper over FsDatasetSpi#moveBlockAcrossStorage. But the concerns you raise are quite valid, and I will test to make sure that we don't regress on the data node side. We will cover all these scenarios and build some test cases specifically to replicate the error conditions that you are pointing out. Right now, all we do is take a lock at the FsDatasetSpi level (since we want mutual exclusion with DirectoryScanner) and rely on moveBlockAcrossStorage. I will also add more details to Architecture_and_testplan.pdf, if it is not well explained in that document.