Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-555

Provide an option to turn off priorities in jobs

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The fairshare scheduler can define pools mapping to queues (as defined in the capacity scheduler - HADOOP-3445). When used in this manner, one can imagine queues set up to be used by users who come from disparate teams or organizations (say a default queue). For such a queue, it makes sense to ignore job priorities and consider the queue as strict FIFO, as it is difficult to compare priorities of jobs from different users.

        Activity

        Hide
        Matei Zaharia added a comment -

        Right now, we do fair sharing between jobs in the same pool, so the jobs in the default queue would share the excess capacity in the cluster equally. This is what we wanted at Facebook because jobs were submitted to a default queue by multiple users and we wanted short jobs to be able to finish while long jobs are also running. However, it would make a lot of sense to support a "fifo" setting on each pool to make some pools be FIFO. Would that be acceptable?

        Show
        Matei Zaharia added a comment - Right now, we do fair sharing between jobs in the same pool, so the jobs in the default queue would share the excess capacity in the cluster equally. This is what we wanted at Facebook because jobs were submitted to a default queue by multiple users and we wanted short jobs to be able to finish while long jobs are also running. However, it would make a lot of sense to support a "fifo" setting on each pool to make some pools be FIFO. Would that be acceptable?
        Hide
        Hemanth Yamijala added a comment -

        Matei, the "fifo" setting in the fairshare scheduler currently is priority based fifo, meaning the jobs are sorted by priorities, and then time. I was hoping that there would be a way to turn off priority based comparison completely, and make it pure fifo.

        I hope the use case made sense. Since users from possibly competing organizations could submit jobs to a default pool, their job priorities cannot be compared apples to apples. Hence, the site administrator might prefer if the scheduler can be asked to completely turn off using priorities in comparison, and let only first come first served win.

        So, I was thinking if there could be a 'support-priorities' tag per pool. If this is off, then priorities of jobs are not used by the scheduler. Note that the scheduling mode can be 'fairshare' or 'fifo' mode still. In both cases, priorities would be ignored.

        If you had also meant fifo in your comment above to mean the same, then we are in sync.

        Show
        Hemanth Yamijala added a comment - Matei, the "fifo" setting in the fairshare scheduler currently is priority based fifo, meaning the jobs are sorted by priorities, and then time. I was hoping that there would be a way to turn off priority based comparison completely, and make it pure fifo. I hope the use case made sense. Since users from possibly competing organizations could submit jobs to a default pool, their job priorities cannot be compared apples to apples. Hence, the site administrator might prefer if the scheduler can be asked to completely turn off using priorities in comparison, and let only first come first served win. So, I was thinking if there could be a 'support-priorities' tag per pool. If this is off, then priorities of jobs are not used by the scheduler. Note that the scheduling mode can be 'fairshare' or 'fifo' mode still. In both cases, priorities would be ignored. If you had also meant fifo in your comment above to mean the same, then we are in sync.
        Hide
        Matei Zaharia added a comment -

        Right now the FIFO mode is global and ignores pools altogether - it was meant to be a way to switch back to FIFO at runtime if for some reason the fair sharing is buggy, not as a final feature of the scheduler. This is why it used priorities, to act exactly like the old scheduler. So what I meant is having a setting for each pool to determine whether it will run its jobs in FIFO order or perform fair sharing between them, but always doing sharing between pools. If we do this, then I think having an "ignore priorities" option per pool also works for the default pool use case you mentioned above where the pool is a common resource and it doesn't make sense to let users set priorities.

        Show
        Matei Zaharia added a comment - Right now the FIFO mode is global and ignores pools altogether - it was meant to be a way to switch back to FIFO at runtime if for some reason the fair sharing is buggy, not as a final feature of the scheduler. This is why it used priorities, to act exactly like the old scheduler. So what I meant is having a setting for each pool to determine whether it will run its jobs in FIFO order or perform fair sharing between them, but always doing sharing between pools. If we do this, then I think having an "ignore priorities" option per pool also works for the default pool use case you mentioned above where the pool is a common resource and it doesn't make sense to let users set priorities.
        Hide
        Hemanth Yamijala added a comment -

        If I understand correctly, the fairshare scheduler currently globally orders its jobs (irrespective of pools they belong too) based on the mode - fairshare/deficit based or fifo. Are you proposing that we make the mode setting per pool ? This would consequently change the algorithm to maintain job lists per pool as well, no ?

        Show
        Hemanth Yamijala added a comment - If I understand correctly, the fairshare scheduler currently globally orders its jobs (irrespective of pools they belong too) based on the mode - fairshare/deficit based or fifo. Are you proposing that we make the mode setting per pool ? This would consequently change the algorithm to maintain job lists per pool as well, no ?
        Hide
        Matei Zaharia added a comment -

        Right, this is what I was proposing. For fair sharing mode, the global ordering by deficit works fine because it makes sure that each pool gets its own share. However, for FIFO, it means that the scheduler as a whole runs jobs in FIFO order, regardless of what pool they're in, so things like per-pool minimum shares don't work. I imagine it may be useful to support having multiple pools each of which runs FIFO - this would emulate the capabilities of the capacity scheduler for example. Does this make sense?

        Doing this would currently require tricky logic if we wanted to integrate it with the deficit model to allow FIFO and fair-share pools to coexist. However, if instead we remove the concept of deficits and sort jobs by how far they are below their fair share (or how long they've waited if they are below their min share), I think it should be possible to define a comparator that takes into account both FIFO and fair-sharing pools. This is why I'm proposing to remove the deficit model in HADOOP-4803. It seemed like a good idea initially, especially when the fair scheduler had no preemption, but it does the wrong thing at times (as pointed out in that JIRA) and it complicates the code, making it hard to add features such as this.

        Show
        Matei Zaharia added a comment - Right, this is what I was proposing. For fair sharing mode, the global ordering by deficit works fine because it makes sure that each pool gets its own share. However, for FIFO, it means that the scheduler as a whole runs jobs in FIFO order, regardless of what pool they're in, so things like per-pool minimum shares don't work. I imagine it may be useful to support having multiple pools each of which runs FIFO - this would emulate the capabilities of the capacity scheduler for example. Does this make sense? Doing this would currently require tricky logic if we wanted to integrate it with the deficit model to allow FIFO and fair-share pools to coexist. However, if instead we remove the concept of deficits and sort jobs by how far they are below their fair share (or how long they've waited if they are below their min share), I think it should be possible to define a comparator that takes into account both FIFO and fair-sharing pools. This is why I'm proposing to remove the deficit model in HADOOP-4803 . It seemed like a good idea initially, especially when the fair scheduler had no preemption, but it does the wrong thing at times (as pointed out in that JIRA) and it complicates the code, making it hard to add features such as this.
        Hide
        Hemanth Yamijala added a comment -

        Yes, I too felt that with the current design it was going to be complicated to do FIFO ordering per pool while supporting a global ordering by deficit.

        Can you give some more details about how you intend the new design to look like. For e.g. would there be jobs per pool, which would be ordered according to FIFO or the time they've waited to get their fairshare ?

        Show
        Hemanth Yamijala added a comment - Yes, I too felt that with the current design it was going to be complicated to do FIFO ordering per pool while supporting a global ordering by deficit. Can you give some more details about how you intend the new design to look like. For e.g. would there be jobs per pool, which would be ordered according to FIFO or the time they've waited to get their fairshare ?
        Hide
        Matei Zaharia added a comment - - edited

        Yes, the design would be as follows:

        • Each job belongs to a pool. Pools may be marked as either FIFO or fair sharing.
        • Each pool has a minimum share (guaranteed share) defined in the config. Any excess capacity is divided between pools according to fair sharing, as in the current scheduler.
        • Each pool takes its min share and fair share and divides it among the jobs inside the pool:
          • For a fair sharing pool, we divide the min and fair shares equally among jobs as happens now (well, technically using weights)
          • For a FIFO pool, we give as much of the min share as possible to the first job, give any excess to the second job (if the first job didn't have enough unlaunched tasks to consume the pool's full share), etc until we run out. Same with fair share.
        • Now for the purpose of scheduling, we can have one big list of runnable jobs, each of which has a min share and a fair share. We sort this list first by whether the job is below its min share (breaking ties by how long it's been below this), and then for the remaining jobs by how far each job is below its fair share (as a percent). We then scan through it to pick tasks, using the same wait technique proposed in HADOOP-4667 to skip jobs that don't happen to have local tasks for the current heartbeat.

        On top of this we can have any logic we want for user limits, when to initialize jobs, etc (as we've been talking about in other JIRAs).

        I think this should work without very complicated code, and will be much easier to understand than the current deficit stuff. It also leaves the option open to have pools with scheduling disciplines other than FIFO or fair sharing, since the job of each pool is just to subdivide its own min and fair shares among the jobs within it. This might enable something like HADOOP-5199.

        Show
        Matei Zaharia added a comment - - edited Yes, the design would be as follows: Each job belongs to a pool. Pools may be marked as either FIFO or fair sharing. Each pool has a minimum share (guaranteed share) defined in the config. Any excess capacity is divided between pools according to fair sharing, as in the current scheduler. Each pool takes its min share and fair share and divides it among the jobs inside the pool: For a fair sharing pool, we divide the min and fair shares equally among jobs as happens now (well, technically using weights) For a FIFO pool, we give as much of the min share as possible to the first job, give any excess to the second job (if the first job didn't have enough unlaunched tasks to consume the pool's full share), etc until we run out. Same with fair share. Now for the purpose of scheduling, we can have one big list of runnable jobs, each of which has a min share and a fair share. We sort this list first by whether the job is below its min share (breaking ties by how long it's been below this), and then for the remaining jobs by how far each job is below its fair share (as a percent). We then scan through it to pick tasks, using the same wait technique proposed in HADOOP-4667 to skip jobs that don't happen to have local tasks for the current heartbeat. On top of this we can have any logic we want for user limits, when to initialize jobs, etc (as we've been talking about in other JIRAs). I think this should work without very complicated code, and will be much easier to understand than the current deficit stuff. It also leaves the option open to have pools with scheduling disciplines other than FIFO or fair sharing, since the job of each pool is just to subdivide its own min and fair shares among the jobs within it. This might enable something like HADOOP-5199 .
        Hide
        Vivek Ratan added a comment -

        Good discussion.

        My comments are more relevant in the context of HADOOP-5199, but I'm adding them here.

        Pools may be marked as either FIFO or fair sharing.

        There is another kind of pool which is priority based, where jobs are sorted by priorities. Jobs with the same priority are sorted by submission time (rather, start time). This is the kind of pool that the Capacity Scheduler was meant to handle. So maybe you have two pools: fairshare and priority (priority pools have an option to ignore priorities, in which case jobs are sorted only by submission order), or maybe you conceptually have three: fairshare, FIFO (only sorted by submission order), and priority (sorted by priority, then by submission order).

        Each pool has a minimum share (guaranteed share) defined in the config. Any excess capacity is divided between pools according to fair sharing, as in the current scheduler.

        I'm wondering if there's a way to simplify this a little more. Suppose pools only have a minimum (guaranteed) share, which you can also think of as a pool's fair share. Any excess capacity is simply claimed by pools that are 'running behind'. So, a scheduler would first pick a pool based on how much the pool is being utilized (which depends on how many tasks the pool is running compared with its minimum/fair share). If pools running below capacity do not have a task to run, we'll eventually pick a pool that's running at/above capacity and give the slot to it. That way, you don't need a notion of two numbers for a pool - minimum and fairshare - and you don't need to explicitly compute excess capacity and redistribute it among pools. Does that make sense? Or do you still feel you need the notion of both a min share and a fair share?

        Now for the purpose of scheduling, we can have one big list of runnable jobs, each of which has a min share and a fair share.

        Rather than having a global list of jobs, how about each pool just maintains its own list of jobs in the sorted order you mention? The scheduler first picks a pool as I detailed in my previous point, then you can pick a job based on the ordering you've mentioned. Sorting all jobs in the cluster is expensive. Sorting jobs in a single pool will be cheaper because there are fewer jobs.

        Show
        Vivek Ratan added a comment - Good discussion. My comments are more relevant in the context of HADOOP-5199 , but I'm adding them here. Pools may be marked as either FIFO or fair sharing. There is another kind of pool which is priority based, where jobs are sorted by priorities. Jobs with the same priority are sorted by submission time (rather, start time). This is the kind of pool that the Capacity Scheduler was meant to handle. So maybe you have two pools: fairshare and priority (priority pools have an option to ignore priorities, in which case jobs are sorted only by submission order), or maybe you conceptually have three: fairshare, FIFO (only sorted by submission order), and priority (sorted by priority, then by submission order). Each pool has a minimum share (guaranteed share) defined in the config. Any excess capacity is divided between pools according to fair sharing, as in the current scheduler. I'm wondering if there's a way to simplify this a little more. Suppose pools only have a minimum (guaranteed) share, which you can also think of as a pool's fair share. Any excess capacity is simply claimed by pools that are 'running behind'. So, a scheduler would first pick a pool based on how much the pool is being utilized (which depends on how many tasks the pool is running compared with its minimum/fair share). If pools running below capacity do not have a task to run, we'll eventually pick a pool that's running at/above capacity and give the slot to it. That way, you don't need a notion of two numbers for a pool - minimum and fairshare - and you don't need to explicitly compute excess capacity and redistribute it among pools. Does that make sense? Or do you still feel you need the notion of both a min share and a fair share? Now for the purpose of scheduling, we can have one big list of runnable jobs, each of which has a min share and a fair share. Rather than having a global list of jobs, how about each pool just maintains its own list of jobs in the sorted order you mention? The scheduler first picks a pool as I detailed in my previous point, then you can pick a job based on the ordering you've mentioned. Sorting all jobs in the cluster is expensive. Sorting jobs in a single pool will be cheaper because there are fewer jobs.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Regarding distribution of excess capacity, it looks like a fairly common thing to do and can be made pluggable - to distribute it evenly across pools, or according to pool weights, or by how much pools are running behind etc.

        It is looking to me that the design is moving more and more towards the objectives of HADOOP-5199. As far as fair-scheduler is concerned, this seems like a big design change(though may not be that big a code change). So, shouldn't we do all this as part of HADOOP-5199 and do a quick fix to disable priorities in the current fair scheduler here?

        My 2 cents. Thoughts?

        Show
        Vinod Kumar Vavilapalli added a comment - Regarding distribution of excess capacity, it looks like a fairly common thing to do and can be made pluggable - to distribute it evenly across pools, or according to pool weights, or by how much pools are running behind etc. It is looking to me that the design is moving more and more towards the objectives of HADOOP-5199 . As far as fair-scheduler is concerned, this seems like a big design change(though may not be that big a code change). So, shouldn't we do all this as part of HADOOP-5199 and do a quick fix to disable priorities in the current fair scheduler here? My 2 cents. Thoughts?
        Hide
        Yiping Han added a comment -

        Regarding distribution of excess capacity. Should we have a priority mode between queues. Say, queues are given priorities, so when there are excess capacity, they are given to the queues with highest priority first.

        I agree with Vinod, this should be made pluggable.

        Show
        Yiping Han added a comment - Regarding distribution of excess capacity. Should we have a priority mode between queues. Say, queues are given priorities, so when there are excess capacity, they are given to the queues with highest priority first. I agree with Vinod, this should be made pluggable.
        Hide
        Allen Wittenauer added a comment -

        This was done elsewhere by a flag in the schedulers themselves. Closing as fixed.

        Show
        Allen Wittenauer added a comment - This was done elsewhere by a flag in the schedulers themselves. Closing as fixed.

          People

          • Assignee:
            Unassigned
            Reporter:
            Hemanth Yamijala
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development