Thanks for chiming in, Arun.
This JIRA focuses on adding disk scheduling, and isolation for local disk read I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we handle that too automatically.
We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN.
The Linux aspects are only for isolation, and this needs to be pluggable.
Wei and I are more familiar with FairScheduler, and talk about weighted division between queues from that standpoint. We are eager to hear your thoughts on how we should do this with CapacityScheduler, and augment the configs etc. if need be. I was thinking we would handle it similar to how it handles CPU today (more on that later).
We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster
Our initial thinking was along these lines. However, similar to CPU, it gets very hard for a user to specify the bandwidth requirement. It is hard to figure out my container needs 200 MBps (and 2 GHz CPU). Furthermore, it is hard to enforce bandwidth isolation. When multiple processes are accessing a disk, its aggregate bandwidth could go down significantly. To guarantee bandwidth, I believe the scheduler has to be super conservative with its allocations.
Given all this, we thought we should probably handle it the way we did CPU. Each process asks for 'n' vdisks to capture the number of disks it needs. To avoid floating point computations, we added an NM config for the available vdisks. Heterogeneity in terms of number of disks is easily handled with vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not handled, similar to our CPU story. I propose we work on this heterogeneity as one of the follow-up items.
Spindle locality or I/O parallelism is a real concern
Agree. Is it okay if we finish this work and follow-up with spindle-locality? We have some thoughts on how to handle it, but left it out of the doc to keep the design focused.