Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2139

[Umbrella] Support for Disk as a Resource in YARN

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      YARN should consider disk as another resource for (1) scheduling tasks on nodes, (2) isolation at runtime, (3) spindle locality.

      1. Disk_IO_Isolation_Scheduling_3.pdf
        245 kB
        Karthik Kambatla
      2. Disk_IO_Scheduling_Design_1.pdf
        176 kB
        Wei Yan
      3. Disk_IO_Scheduling_Design_2.pdf
        227 kB
        Wei Yan
      4. YARN-2139-prototype.patch
        370 kB
        Wei Yan
      5. YARN-2139-prototype-2.patch
        327 kB
        Wei Yan

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          I have looked a bit at this

          Cgroups can do IO throttling -but that's just local IO, and doesn't cover HDFS IO, as that's happening in the HDFS process, and some of it can even be happening over the net on remote hosts.

          you could do some co-operative throttling of IO clients in hdfs/fs library: the client itself does some moving average throttling.

          Otherwise, you could perhaps include some IO requirements in a container request (light, medium, heavy) and include that in placement

          Show
          stevel@apache.org Steve Loughran added a comment - I have looked a bit at this Cgroups can do IO throttling -but that's just local IO, and doesn't cover HDFS IO, as that's happening in the HDFS process, and some of it can even be happening over the net on remote hosts. you could do some co-operative throttling of IO clients in hdfs/fs library: the client itself does some moving average throttling. Otherwise, you could perhaps include some IO requirements in a container request (light, medium, heavy) and include that in placement
          Hide
          ywskycn Wei Yan added a comment -

          For the disk IO, we can use cgroup to limit the IO usage for each local container. For HDFS remote read/write, may refer as the network IO. We can also limit the network bandwidth for each container (YARN-2140).

          Otherwise, you could perhaps include some IO requirements in a container request (light, medium, heavy) and include that in placement

          Yes, put some fields in the container request is good. This can help avoid put two IO-intensive tasks together.

          Show
          ywskycn Wei Yan added a comment - For the disk IO, we can use cgroup to limit the IO usage for each local container. For HDFS remote read/write, may refer as the network IO. We can also limit the network bandwidth for each container ( YARN-2140 ). Otherwise, you could perhaps include some IO requirements in a container request (light, medium, heavy) and include that in placement Yes, put some fields in the container request is good. This can help avoid put two IO-intensive tasks together.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I don't think I was clear enough, sorry

          HDFS disk IO takes place in the datanode not in any YARN container. Accordingly cgroup throttling of container disk IO has no effect on HDFS bandwidth/IOPs. Network throttling could, if it it limited loopback and remote network load -but anything over the unix-domain-sockets bypass isn't going over the tcp stack.

          Show
          stevel@apache.org Steve Loughran added a comment - I don't think I was clear enough, sorry HDFS disk IO takes place in the datanode not in any YARN container . Accordingly cgroup throttling of container disk IO has no effect on HDFS bandwidth/IOPs. Network throttling could, if it it limited loopback and remote network load -but anything over the unix-domain-sockets bypass isn't going over the tcp stack.
          Hide
          sandyr Sandy Ryza added a comment -

          With short-circuit local reads, reads on local data do occur inside the YARN container process. This still won't help with remote reads of course.

          Show
          sandyr Sandy Ryza added a comment - With short-circuit local reads, reads on local data do occur inside the YARN container process. This still won't help with remote reads of course.
          Hide
          tucu00 Alejandro Abdelnur added a comment -

          As Sandy says, for local reads short circuit kicks in for local HDFS blocks. It kicks in for any other local FS access. So block io controller would kick in. For remote HDFS reads, network io controller would kick in. So effectively we can control the resources the container uses.

          Show
          tucu00 Alejandro Abdelnur added a comment - As Sandy says, for local reads short circuit kicks in for local HDFS blocks. It kicks in for any other local FS access. So block io controller would kick in. For remote HDFS reads, network io controller would kick in. So effectively we can control the resources the container uses.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          well, I am happy to be wrong here. I was of the belief that even with the domain sockets, it was the DN process doing the disk IO.

          before writing a line of code, however, why not spend time experimenting with ionice to verify

          1. that ionicing a container will throttle the HDFS access it makes
          2. that ionicing the data node does not thottle its access

          it'd be good to have some tests set up to verify that this all works anyway, so really I'm advocating tests first

          Show
          stevel@apache.org Steve Loughran added a comment - well, I am happy to be wrong here. I was of the belief that even with the domain sockets, it was the DN process doing the disk IO. before writing a line of code, however, why not spend time experimenting with ionice to verify that ionicing a container will throttle the HDFS access it makes that ionicing the data node does not thottle its access it'd be good to have some tests set up to verify that this all works anyway, so really I'm advocating tests first
          Hide
          ywskycn Wei Yan added a comment -

          Attach a design draft.

          Show
          ywskycn Wei Yan added a comment - Attach a design draft.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          looks a good first draft

          1. unless this does address HDFS, call out that this is local disk IO
          2. this is really disk io bandwidth, so it should use an option, like vlocaldiskIObandwidth. This will avoid confusion (make clear its not HDFS), and add scope for the addition of future options: IOPs and actual allocation of entire disks to containers
          3. what's the testability of this feature?
          Show
          stevel@apache.org Steve Loughran added a comment - looks a good first draft unless this does address HDFS, call out that this is local disk IO this is really disk io bandwidth, so it should use an option, like vlocaldiskIObandwidth. This will avoid confusion (make clear its not HDFS), and add scope for the addition of future options: IOPs and actual allocation of entire disks to containers what's the testability of this feature?
          Hide
          ywskycn Wei Yan added a comment -

          Thanks for the comments, Steve Loughran.
          For you mentioned HDFS read/write problem, we leave it solved by the network part, as we also need handle the hdfs replicate traffic. I agree that we should avoid confuction with HDFS ''fs''.

          The idea of vdisks follows the vcores, where each physical cpu core is measured as some number of vcores. One concern about using real number is that users cannot specify their task requirements easily. One way may solve that is to provide several levels (low, moderate, high, etc) instead of real numbers. This is also similar to the discussions YARN-1024 on how to measure the cpu capacity. We can define the how many IOPs/bandwidth map to 1 vdisks.

          For the testability, currently I have: (1) For fairshare, start several tasks with same operations, put them in a single node, and check their I/O performance whether follows fairsharing; (2) I/O performance isolation for a given task, in a fully loaded cluster, we replay the given task several times, and verify when its I/O performance is stable. Here the task can do lots of local disk read and directly write operation, and the most time is used to do the I/O.
          Any good testing ideas?

          Show
          ywskycn Wei Yan added a comment - Thanks for the comments, Steve Loughran . For you mentioned HDFS read/write problem, we leave it solved by the network part, as we also need handle the hdfs replicate traffic. I agree that we should avoid confuction with HDFS ''fs''. The idea of vdisks follows the vcores, where each physical cpu core is measured as some number of vcores. One concern about using real number is that users cannot specify their task requirements easily. One way may solve that is to provide several levels (low, moderate, high, etc) instead of real numbers. This is also similar to the discussions YARN-1024 on how to measure the cpu capacity. We can define the how many IOPs/bandwidth map to 1 vdisks. For the testability, currently I have: (1) For fairshare, start several tasks with same operations, put them in a single node, and check their I/O performance whether follows fairsharing; (2) I/O performance isolation for a given task, in a fully loaded cluster, we replay the given task several times, and verify when its I/O performance is stable. Here the task can do lots of local disk read and directly write operation, and the most time is used to do the I/O. Any good testing ideas?
          Hide
          kkambatl Karthik Kambatla (Inactive) added a comment -

          this is really disk io bandwidth, so it should use an option, like vlocaldiskIObandwidth. This will avoid confusion (make clear its not HDFS), and add scope for the addition of future options: IOPs and actual allocation of entire disks to containers

          Good point. The document should probably discuss this in more detail. I think we should separate out the resource model used for requests and scheduling from the way we enforce it.

          For the former, I believe vdisks is a good candidate. Users find it hard to specify disk IO requirements in terms of IOPS and bandwidth; e.g. my MR task needs 200 MBps. vdisks, on the other hand, represent a share of the node and the IO parallelism (in a somewhat vague sense) the task can make use of. Furthermore, it is hard to guarantee a particular bandwidth or performance as they depend on the amount of parallelism and degree of randomness the disk accesses have.

          That said, I see value in making the enforcement pluggable. This JIRA could add the cgroups-based disk-share enforcment. In the future, we could explore other options.

          Show
          kkambatl Karthik Kambatla (Inactive) added a comment - this is really disk io bandwidth, so it should use an option, like vlocaldiskIObandwidth. This will avoid confusion (make clear its not HDFS), and add scope for the addition of future options: IOPs and actual allocation of entire disks to containers Good point. The document should probably discuss this in more detail. I think we should separate out the resource model used for requests and scheduling from the way we enforce it. For the former, I believe vdisks is a good candidate. Users find it hard to specify disk IO requirements in terms of IOPS and bandwidth; e.g. my MR task needs 200 MBps. vdisks, on the other hand, represent a share of the node and the IO parallelism (in a somewhat vague sense) the task can make use of. Furthermore, it is hard to guarantee a particular bandwidth or performance as they depend on the amount of parallelism and degree of randomness the disk accesses have. That said, I see value in making the enforcement pluggable. This JIRA could add the cgroups-based disk-share enforcment. In the future, we could explore other options.
          Hide
          ywskycn Wei Yan added a comment -

          Update a new design doc including spindle-locality information. Comments are very welcome.
          I'll create the sub-tasks to upload prelim code for review soon.

          Show
          ywskycn Wei Yan added a comment - Update a new design doc including spindle-locality information. Comments are very welcome. I'll create the sub-tasks to upload prelim code for review soon.
          Hide
          acmurthy Arun C Murthy added a comment -

          Wei Yan - thanks for the design doc, it's well put together.

          Some feedback:

          1. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster and supportable on other platforms like Windows.
          2. Spindle locality or I/O parallelism is a real concern - we should probably support bandwidth and spindles.
          3. Spindle locality or I/O parallelism cannot be tied to HDFS. In fact, YARN should not have a dependency on HDFS at all (smile)! This is particularly important in light of developments like Kafka-on-YARN (KAFKA-1754) because people want to use YARN to deploy only Kafka & Storm etc. YARN-2817 helps in this regard.

          Makes sense?

          Show
          acmurthy Arun C Murthy added a comment - Wei Yan - thanks for the design doc, it's well put together. Some feedback: We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster and supportable on other platforms like Windows. Spindle locality or I/O parallelism is a real concern - we should probably support bandwidth and spindles . Spindle locality or I/O parallelism cannot be tied to HDFS. In fact, YARN should not have a dependency on HDFS at all ( smile )! This is particularly important in light of developments like Kafka-on-YARN ( KAFKA-1754 ) because people want to use YARN to deploy only Kafka & Storm etc. YARN-2817 helps in this regard. Makes sense?
          Hide
          ywskycn Wei Yan added a comment -

          Thanks for the comments, Arun C Murthy.

          1. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster and supportable on other platforms like Windows.

          One concern for the bandwidth is that, it may be hard for user to specify the bandwidth requirement when submitting the resource requests.

          2. Spindle locality or I/O parallelism is a real concern. 3. Spindle locality or I/O parallelism cannot be tied to HDFS.

          Yes, I agree that spindle locality is very important. I'll look into KAFKA-1754 to rethink about the spindle design.

          Show
          ywskycn Wei Yan added a comment - Thanks for the comments, Arun C Murthy . 1. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster and supportable on other platforms like Windows. One concern for the bandwidth is that, it may be hard for user to specify the bandwidth requirement when submitting the resource requests. 2. Spindle locality or I/O parallelism is a real concern. 3. Spindle locality or I/O parallelism cannot be tied to HDFS. Yes, I agree that spindle locality is very important. I'll look into KAFKA-1754 to rethink about the spindle design.
          Hide
          kasha Karthik Kambatla added a comment - - edited

          Thanks for chiming in, Arun.

          This JIRA focuses on adding disk scheduling, and isolation for local disk read I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we handle that too automatically.

          We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN.

          The Linux aspects are only for isolation, and this needs to be pluggable.

          Wei and I are more familiar with FairScheduler, and talk about weighted division between queues from that standpoint. We are eager to hear your thoughts on how we should do this with CapacityScheduler, and augment the configs etc. if need be. I was thinking we would handle it similar to how it handles CPU today (more on that later).

          We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster

          Our initial thinking was along these lines. However, similar to CPU, it gets very hard for a user to specify the bandwidth requirement. It is hard to figure out my container needs 200 MBps (and 2 GHz CPU). Furthermore, it is hard to enforce bandwidth isolation. When multiple processes are accessing a disk, its aggregate bandwidth could go down significantly. To guarantee bandwidth, I believe the scheduler has to be super conservative with its allocations.

          Given all this, we thought we should probably handle it the way we did CPU. Each process asks for 'n' vdisks to capture the number of disks it needs. To avoid floating point computations, we added an NM config for the available vdisks. Heterogeneity in terms of number of disks is easily handled with vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not handled, similar to our CPU story. I propose we work on this heterogeneity as one of the follow-up items.

          Spindle locality or I/O parallelism is a real concern

          Agree. Is it okay if we finish this work and follow-up with spindle-locality? We have some thoughts on how to handle it, but left it out of the doc to keep the design focused.

          Show
          kasha Karthik Kambatla added a comment - - edited Thanks for chiming in, Arun. This JIRA focuses on adding disk scheduling, and isolation for local disk read I/O. HDFS short-circuit reads happen to be local-disk reads, and hence we handle that too automatically. We shouldn't embed Linux or blkio specific semantics such as proportional weight division into YARN. The Linux aspects are only for isolation, and this needs to be pluggable. Wei and I are more familiar with FairScheduler, and talk about weighted division between queues from that standpoint. We are eager to hear your thoughts on how we should do this with CapacityScheduler, and augment the configs etc. if need be. I was thinking we would handle it similar to how it handles CPU today (more on that later). We need something generic such as bandwidth which can be understood by users, supportable on heterogenous nodes in the same cluster Our initial thinking was along these lines. However, similar to CPU, it gets very hard for a user to specify the bandwidth requirement. It is hard to figure out my container needs 200 MBps (and 2 GHz CPU). Furthermore, it is hard to enforce bandwidth isolation. When multiple processes are accessing a disk, its aggregate bandwidth could go down significantly. To guarantee bandwidth, I believe the scheduler has to be super conservative with its allocations. Given all this, we thought we should probably handle it the way we did CPU. Each process asks for 'n' vdisks to capture the number of disks it needs. To avoid floating point computations, we added an NM config for the available vdisks. Heterogeneity in terms of number of disks is easily handled with vdisks-per-node knob. Heterogeneity in each disk's capacity or bandwidth is not handled, similar to our CPU story. I propose we work on this heterogeneity as one of the follow-up items. Spindle locality or I/O parallelism is a real concern Agree. Is it okay if we finish this work and follow-up with spindle-locality? We have some thoughts on how to handle it, but left it out of the doc to keep the design focused.
          Hide
          ywskycn Wei Yan added a comment -

          I submit a prototype of the code implementation, to illustrate the basic design and implementation.
          Code changes in three major parts:
          (1) API: add vdisks as a 3rd type of resources, besides CPU/memory. The NM will specifly its own vdisks resource, and the AM includes vdisks in the resource request.
          (2) Scheduler: the scheduler will consider vdisks availability when scheduling. Additionally, the DRF policy also considers vdisks when choosing the dominant resource.
          (3) I/O isolation: this is implemented in the NM side. Use cgroup's blkio system to do the container I/O isolation.

          Will separate the patch into several sub-task patches once collecting more comments and the design, implementation.

          Show
          ywskycn Wei Yan added a comment - I submit a prototype of the code implementation, to illustrate the basic design and implementation. Code changes in three major parts: (1) API: add vdisks as a 3rd type of resources, besides CPU/memory. The NM will specifly its own vdisks resource, and the AM includes vdisks in the resource request. (2) Scheduler: the scheduler will consider vdisks availability when scheduling. Additionally, the DRF policy also considers vdisks when choosing the dominant resource. (3) I/O isolation: this is implemented in the NM side. Use cgroup's blkio system to do the container I/O isolation. Will separate the patch into several sub-task patches once collecting more comments and the design, implementation.
          Hide
          ywskycn Wei Yan added a comment -

          This prototype patch is posted to garner feedback. The spindle locality has not been finished, will post once get updated.

          Show
          ywskycn Wei Yan added a comment - This prototype patch is posted to garner feedback. The spindle locality has not been finished, will post once get updated.
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks for the prototype, Wei. In light of the updates on YARN-2791 and YARN-2817, I propose we incorporate suggestions from Swapnil Daingade and Arun C Murthy before posting patches for sub-tasks.

          Updated JIRA title, description, and marked it unassigned as this is an umbrella JIRA.

          Show
          kasha Karthik Kambatla added a comment - Thanks for the prototype, Wei. In light of the updates on YARN-2791 and YARN-2817 , I propose we incorporate suggestions from Swapnil Daingade and Arun C Murthy before posting patches for sub-tasks. Updated JIRA title, description, and marked it unassigned as this is an umbrella JIRA.
          Hide
          acmurthy Arun C Murthy added a comment -

          Sorry, been busy with 2.6.0 - just coming up for air.

          What are we modeling with vdisk again? What is the metric? Is it directly the blkio parameter? If so, that is my biggest concern.

          Show
          acmurthy Arun C Murthy added a comment - Sorry, been busy with 2.6.0 - just coming up for air. What are we modeling with vdisk again? What is the metric? Is it directly the blkio parameter? If so, that is my biggest concern.
          Hide
          kasha Karthik Kambatla added a comment -

          It is very similar to vcores. vdisks is the number of virtual disks, no metric just a number.

          If we want to allow upto 'n' tasks to share a disk, vdisks = n * num-disks. For cases with n > 1, spindle locality will help with ensuring all the 'n' vdisks correspond to the same spindle(s).

          Show
          kasha Karthik Kambatla added a comment - It is very similar to vcores. vdisks is the number of virtual disks, no metric just a number. If we want to allow upto 'n' tasks to share a disk, vdisks = n * num-disks . For cases with n > 1, spindle locality will help with ensuring all the 'n' vdisks correspond to the same spindle(s).
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Wei Yan for the design doc and prototype.

          I have similar feeling like what Arun C Murthy commented, the disk resource is a little different from vcore. CPU is a shared resource, processes/threads can occupy cpu cores and also can be easily switch to another cores. But disks is not, (in spite of RAID), if a process write to a file on local disk (like Kafka), you cannot switch the file being writing to another disk easily.

          And also, we need consider if there're multiple containers scheduled to a same physical disk, it is possible that the total bandwidth of these containers will drop very fast.

          So I think the scheduling for disks is more like affinity to disks (like give disk#1,#2,#4 to the container) instead of just limit number of processes on each node.

          Any thoughts? Please feel free to correct me if I was wrong.

          Thanks,
          Wangda

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Wei Yan for the design doc and prototype. I have similar feeling like what Arun C Murthy commented, the disk resource is a little different from vcore. CPU is a shared resource, processes/threads can occupy cpu cores and also can be easily switch to another cores. But disks is not, (in spite of RAID), if a process write to a file on local disk (like Kafka), you cannot switch the file being writing to another disk easily. And also, we need consider if there're multiple containers scheduled to a same physical disk, it is possible that the total bandwidth of these containers will drop very fast. So I think the scheduling for disks is more like affinity to disks (like give disk#1,#2,#4 to the container) instead of just limit number of processes on each node. Any thoughts? Please feel free to correct me if I was wrong. Thanks, Wangda
          Hide
          kasha Karthik Kambatla added a comment -

          Wangda Tan - completely agree with both Arun and you on the spindle-locality-affinity front. The design doc hints at it, but doesn't cover it in as much detail as it should. I am all up for accomplishing that too here, I can work on fleshing out the locality-affinity pieces as we start getting the remaining parts in.

          I am considering starting the development on a feature-branch so we have a chance to change things before merging into trunk and branch-2. Are people okay with that?

          Show
          kasha Karthik Kambatla added a comment - Wangda Tan - completely agree with both Arun and you on the spindle-locality-affinity front. The design doc hints at it, but doesn't cover it in as much detail as it should. I am all up for accomplishing that too here, I can work on fleshing out the locality-affinity pieces as we start getting the remaining parts in. I am considering starting the development on a feature-branch so we have a chance to change things before merging into trunk and branch-2. Are people okay with that?
          Hide
          bikassaha Bikas Saha added a comment -

          Given that this design and possible implementation might go through unstable rounds and are currently not abstracted enough in the core code, doing this on a branch seems prudent.
          Given that SSDs are becoming common, thinking of storage as only spinning disks may be limited. Multiple writers may affect each other more negatively on spinning disk vs SSDs. It may be useful to see if the consideration of storage could be abstracted into a plugin so that storage could have a different resource allocation policy by storage type (e.g. allocate/share by spindle for spinning disk storage vs allocate/share by iops on ssd storage vs allocate/share by network bandwidth for non-DAS storage). If we can abstract the policy into a plugin on trunk itself then perhaps we would not need a branch. Secondly, it will probably take a long time to agree on what a common policy should be and the consensus decision will probably not be a good fit for a large percentage of real clusters because of hardware variety. So making this a plugin would enable quicker development, trial and usage of disk based allocation compared to arriving at a grand unified allocation model for storage.

          Show
          bikassaha Bikas Saha added a comment - Given that this design and possible implementation might go through unstable rounds and are currently not abstracted enough in the core code, doing this on a branch seems prudent. Given that SSDs are becoming common, thinking of storage as only spinning disks may be limited. Multiple writers may affect each other more negatively on spinning disk vs SSDs. It may be useful to see if the consideration of storage could be abstracted into a plugin so that storage could have a different resource allocation policy by storage type (e.g. allocate/share by spindle for spinning disk storage vs allocate/share by iops on ssd storage vs allocate/share by network bandwidth for non-DAS storage). If we can abstract the policy into a plugin on trunk itself then perhaps we would not need a branch. Secondly, it will probably take a long time to agree on what a common policy should be and the consensus decision will probably not be a good fit for a large percentage of real clusters because of hardware variety. So making this a plugin would enable quicker development, trial and usage of disk based allocation compared to arriving at a grand unified allocation model for storage.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Bikas Saha and Karthik Kambatla,

          +1 for work on a branch, there might be some great amount of changes across all the major modules, frequently rebasing might be a issue if this is based on trunk.
          And also totally agree about having an abstract policy to wrap disk affinity / iops / bandwidth, etc.

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Bikas Saha and Karthik Kambatla , +1 for work on a branch, there might be some great amount of changes across all the major modules, frequently rebasing might be a issue if this is based on trunk. And also totally agree about having an abstract policy to wrap disk affinity / iops / bandwidth, etc.
          Hide
          kasha Karthik Kambatla added a comment -

          Valid points, Bikas. Wei Yan and I will spend sometime and propose a design that would allow plugging in these multiple dimensions.

          Show
          kasha Karthik Kambatla added a comment - Valid points, Bikas. Wei Yan and I will spend sometime and propose a design that would allow plugging in these multiple dimensions.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I'd assumed the vspindles you asked for were == SATA HDD spindles, so on an SSD the mapping of multiple vspindles to a physical one would make sense. And if even faster persistent storage/storage interconnect comes out, you'd increase the number.

          Show
          stevel@apache.org Steve Loughran added a comment - I'd assumed the vspindles you asked for were == SATA HDD spindles, so on an SSD the mapping of multiple vspindles to a physical one would make sense. And if even faster persistent storage/storage interconnect comes out, you'd increase the number.
          Hide
          sdaingade Swapnil Daingade added a comment -

          +1 for having an abstract policy to wrap spindles / disk affinity / iops / bandwidth, etc.

          Show
          sdaingade Swapnil Daingade added a comment - +1 for having an abstract policy to wrap spindles / disk affinity / iops / bandwidth, etc.
          Hide
          kasha Karthik Kambatla added a comment -

          A bunch of us (Jian, Karthik, Ram, Vinod, Wei) met up offline last week. Vinod suggested we follow a three-phase approach - (1) avoid overallocation of disk resources, (2) NMs to provide isolation to disallow a container messing with another, (3) add scheduling support for diverse disk IO requirements between containers. We expect little disagreement on the first two items, and understand consensus on the third item might take a little time. So, we want to proceed with getting the patches in for (1) and (2) while we arbitrate on (3).

          The other primary concern in the discussion and other JIRA comments is the ability to view disk resource in other dimensions - bandwidth, iops etc.

          I have updated the design document accordingly:

          1. Added a section on the Scope of the work
          2. Updated the approach to reflect the three main phases involved
          3. Updated design to plug-in isolation and scheduling along different dimensions

          The development would be on a branch - YARN-2139, but would like the patches to go through the same review-commit process as they would on trunk.

          Show
          kasha Karthik Kambatla added a comment - A bunch of us (Jian, Karthik, Ram, Vinod, Wei) met up offline last week. Vinod suggested we follow a three-phase approach - (1) avoid overallocation of disk resources, (2) NMs to provide isolation to disallow a container messing with another, (3) add scheduling support for diverse disk IO requirements between containers. We expect little disagreement on the first two items, and understand consensus on the third item might take a little time. So, we want to proceed with getting the patches in for (1) and (2) while we arbitrate on (3). The other primary concern in the discussion and other JIRA comments is the ability to view disk resource in other dimensions - bandwidth, iops etc. I have updated the design document accordingly: Added a section on the Scope of the work Updated the approach to reflect the three main phases involved Updated design to plug-in isolation and scheduling along different dimensions The development would be on a branch - YARN-2139 , but would like the patches to go through the same review-commit process as they would on trunk.
          Hide
          bikassaha Bikas Saha added a comment -

          Thanks for the update.
          Its not clear to me how we are going to clearly de-couple 1) and 2) from 3). From first thoughts, scheduling is what prevents over-allocation and the NM enforces the scheduling decision.
          Could you please throw some light on that?

          Show
          bikassaha Bikas Saha added a comment - Thanks for the update. Its not clear to me how we are going to clearly de-couple 1) and 2) from 3). From first thoughts, scheduling is what prevents over-allocation and the NM enforces the scheduling decision. Could you please throw some light on that?
          Hide
          kasha Karthik Kambatla added a comment -

          The NMs will specify the amount of disk resources on a node, and the RM automatically allocates a fixed amount to each container. For example, each NM could report a vdisks value equal to the number of disks on the node, and each container would get one vdisk. That way, we limit the number of containers running on a node to number of disks on that node.

          Show
          kasha Karthik Kambatla added a comment - The NMs will specify the amount of disk resources on a node, and the RM automatically allocates a fixed amount to each container. For example, each NM could report a vdisks value equal to the number of disks on the node, and each container would get one vdisk. That way, we limit the number of containers running on a node to number of disks on that node.
          Hide
          bikassaha Bikas Saha added a comment -

          Is the concept of vdisk representing a spinning disk or is it going to be some pluggable API?

          Show
          bikassaha Bikas Saha added a comment - Is the concept of vdisk representing a spinning disk or is it going to be some pluggable API?
          Hide
          kasha Karthik Kambatla added a comment -

          disk-vdisks is one of the ways to represent disk resources, it captures disk shares for weighted sharing of spinning disks/ SSDs. In the future, we could add other dimensions like disk-bandwidth, disk-iops, disk-capacity etc. To specify the dimension(s) to consider for isolation and scheduling, one could set yarn.nodemanager.resource.disk-dimensions and yarn.scheduler.disk-dimensions. The design doc - Disk_IO_Isolation_Scheduling_3.pdf - has more details.

          Show
          kasha Karthik Kambatla added a comment - disk-vdisks is one of the ways to represent disk resources, it captures disk shares for weighted sharing of spinning disks/ SSDs. In the future, we could add other dimensions like disk-bandwidth, disk-iops, disk-capacity etc. To specify the dimension(s) to consider for isolation and scheduling, one could set yarn.nodemanager.resource.disk-dimensions and yarn.scheduler.disk-dimensions. The design doc - Disk_IO_Isolation_Scheduling_3.pdf - has more details.
          Hide
          bikassaha Bikas Saha added a comment -

          So to be clear, currently vdisks is counting the number of physical drives present on the box.

          Something to keep in mind would be whether this also entails a change in the NM policy of providing a directly on every local dir (which typically maps to every disk) to every task. And tasks are free to choose one or more of those dirs (disks) to write to. This puts the spinning disk head under contention and affects performance of all writers on that disk because seeks are expensive. The thumb rule tends to be to allocate as many number of tasks to a machine as the number of disks (maybe 2x) so as to keep this seek cost low. Should we consider evaluating a change in this policy that gives a container 1 local dir to a container with 1 vdisk. This way for a machine with 6 disks (and 6 vdisks) would have 6 tasks running, each with their own "dedicated" disk. Off hand its hard to say how this would compare with all 6 disks allocated to all 6 tasks and letting cgroups enforce sharing. If multiple tasks end up choosing the same disk for their writes, then they may not end up getting the "allocation" that they thought they would get.

          Show
          bikassaha Bikas Saha added a comment - So to be clear, currently vdisks is counting the number of physical drives present on the box. Something to keep in mind would be whether this also entails a change in the NM policy of providing a directly on every local dir (which typically maps to every disk) to every task. And tasks are free to choose one or more of those dirs (disks) to write to. This puts the spinning disk head under contention and affects performance of all writers on that disk because seeks are expensive. The thumb rule tends to be to allocate as many number of tasks to a machine as the number of disks (maybe 2x) so as to keep this seek cost low. Should we consider evaluating a change in this policy that gives a container 1 local dir to a container with 1 vdisk. This way for a machine with 6 disks (and 6 vdisks) would have 6 tasks running, each with their own "dedicated" disk. Off hand its hard to say how this would compare with all 6 disks allocated to all 6 tasks and letting cgroups enforce sharing. If multiple tasks end up choosing the same disk for their writes, then they may not end up getting the "allocation" that they thought they would get.
          Hide
          kasha Karthik Kambatla added a comment -

          currently vdisks is counting the number of physical drives present on the box.

          We see vdisks as a multiple of the number of physical disks on the box. Again, it is just one of the ways, and we can add more ways to share disk resources in the future.

          Should we consider evaluating a change in this policy that gives a container 1 local dir to a container with 1 vdisk. This way for a machine with 6 disks (and 6 vdisks) would have 6 tasks running, each with their own "dedicated" disk.

          Good point. We were thinking of giving the AM the option to choose the amount of disk IO parallelism at the time of launching the container, as part of the spindle locality work. I see AMs wanting to either (1) pick a single local directory for guaranteed performance or (2) stripe accesses across multiple disks for potentially higher throughput based on other work on the node.

          Initially, we could provide a global config for all containers - vdisks to span fewest or most disks.

          Show
          kasha Karthik Kambatla added a comment - currently vdisks is counting the number of physical drives present on the box. We see vdisks as a multiple of the number of physical disks on the box. Again, it is just one of the ways, and we can add more ways to share disk resources in the future. Should we consider evaluating a change in this policy that gives a container 1 local dir to a container with 1 vdisk. This way for a machine with 6 disks (and 6 vdisks) would have 6 tasks running, each with their own "dedicated" disk. Good point. We were thinking of giving the AM the option to choose the amount of disk IO parallelism at the time of launching the container, as part of the spindle locality work. I see AMs wanting to either (1) pick a single local directory for guaranteed performance or (2) stripe accesses across multiple disks for potentially higher throughput based on other work on the node. Initially, we could provide a global config for all containers - vdisks to span fewest or most disks.
          Hide
          sdaingade Swapnil Daingade added a comment -

          Had a look at the latest design doc and was wondering if it would be possible to make the isolation part separate and optional from the avoiding over-allocation part. Enforcing isolation using Cgroups may not always work, especially in cases where HDFS is not the default dfs.

          Show
          sdaingade Swapnil Daingade added a comment - Had a look at the latest design doc and was wondering if it would be possible to make the isolation part separate and optional from the avoiding over-allocation part. Enforcing isolation using Cgroups may not always work, especially in cases where HDFS is not the default dfs.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          YARN-2619 already covered some of the disk isolation work in, well, isolation. It doesn't care about any new concepts like vdisks - all it does is that all containers get 'equal' share of local disk resources.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - YARN-2619 already covered some of the disk isolation work in, well, isolation. It doesn't care about any new concepts like vdisks - all it does is that all containers get 'equal' share of local disk resources.
          Hide
          kassianojm kassiano josé matteussi added a comment -

          Dears,

          I have studied resource management under Hadoop applications running wrapped in Linux containers and I have faced troubles to restrict disk I/O with cgroups (bps_write, bps_read).

          Does anybody know if it is possible to do so?

          I have heard that limiting I/O with cgroups is restricted to synchronous writing (SYNC) and that is why it wouldn't work well with Hadoop + HDFS. Is this still true in more recent kernel implementation?

          Best Regards,
          Kassiano

          Show
          kassianojm kassiano josé matteussi added a comment - Dears, I have studied resource management under Hadoop applications running wrapped in Linux containers and I have faced troubles to restrict disk I/O with cgroups (bps_write, bps_read). Does anybody know if it is possible to do so? I have heard that limiting I/O with cgroups is restricted to synchronous writing (SYNC) and that is why it wouldn't work well with Hadoop + HDFS. Is this still true in more recent kernel implementation? Best Regards, Kassiano
          Hide
          He Tianyi He Tianyi added a comment -

          Recently introduced SSD in my cluster for MapReduce shuffle.
          Then there is one issue, if map output gets too large, it cannot be placed on SSD. We have to implement a custom strategy (called SSDFirst) to make best effort to use SSD, but fallbacks to HDD when available space of SSD gets tight.
          This worked in most cases, but it is only a local optimum. To achieve global optimum, scheduler must be aware and management these resources.

          Show
          He Tianyi He Tianyi added a comment - Recently introduced SSD in my cluster for MapReduce shuffle. Then there is one issue, if map output gets too large, it cannot be placed on SSD. We have to implement a custom strategy (called SSDFirst) to make best effort to use SSD, but fallbacks to HDD when available space of SSD gets tight. This worked in most cases, but it is only a local optimum. To achieve global optimum, scheduler must be aware and management these resources.

            People

            • Assignee:
              Unassigned
              Reporter:
              ywskycn Wei Yan
            • Votes:
              4 Vote for this issue
              Watchers:
              98 Start watching this issue

              Dates

              • Created:
                Updated:

                Development