Thanks a lot for the review, Harsh. Responses to your review inline.
We call the following on every choosing call. While I do not imagine it to be overtly expensive, would it make sense to only do these calls every N times or for after every X total replica size bytes alone (X could be related to the space config we provide for this)?
I doubt seriously this will be a problem at all in practice, so let's hold off on doing so until it does. It wouldn't be tough to do, but I don't think the added code complexity is warranted.
We're using ReflectionUtils with a conf object for initializing the chosen interface's implementation, so the below change to the interface (and associated changes) could perhaps be avoided if your implementation implements Configurable or extends Configured, and overrides setConf(...)?
Real good thinking. Done.
Config names all around are getting longer and longer How 'bout we rename this to be more implementation specific and prefix it dfs.datanode.available.space.volume.choosing.policy.balanced-space- instead of dfs.datanode.fsdataset.volume.choosing.balanced-space-? Just a nit, but I think it then looks more specifically applying to a specific policy.
Good idea. Done.
Why would one not want to have this set as default? If just for initial stability reasons (I think the tests are good), can we have a JIRA to toggle it as default in future?
I thought about changing it to be the default. I didn't for two reasons:
- As you suggest, I think it's a good idea to let this bake a bit before we make it the default all around.
- It's conceivable that one would not want this behavior, since in cases where many concurrent block writes happen on a single node, this policy will skew writes toward disks with more available free space, potentially impacting write throughput.
This latest patch also fixes two bugs I discovered during more manual testing:
- There was a rather narrow race if the free space on a given volume changed during the course of making the volume selection. This is addressed by getting the available space for each volume once upfront and then using that for the duration of the process of choosing.
- There was an issue wherein if more volumes had high available free space than low available free space, the ones with low available free space would actually be allocated more writes per-volume than the others. This is fixed by properly scaling the configured preference percent to account for the size of the two buckets. I also added some more unit tests to verify this is now handled properly.