Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None

      Description

      observing the cluster over the last day - one thing i noticed is that small jobs (single digit tasks) are not doing a good job competing against large jobs. what seems to happen is that:

      • large job comes along and needs to wait for a while for other large jobs.
      • slots are slowly transfered from one large job to another
      • small tasks keep waiting forever.

      is this an artifact of deficit based scheduling? it seems that long pending large jobs are out-scheduling small jobs

        Activity

        Joydeep Sen Sarma created issue -
        Hide
        Joydeep Sen Sarma added a comment -

        it's worse than the title would suggest. compute quotas are also not being honored.

        i am observing our ETL pipeline not getting the configured number of map slots - whereas there's another jobs with no minimum slot guarantees that keeps hogging resources. this other task has a crazy looking deficit (from the advanced scheduler page):

        Deficit: -17796533s

        but regardless - i would have thought that honoring quotas would come before any form of fair scheduling (deficit based or not)

        Show
        Joydeep Sen Sarma added a comment - it's worse than the title would suggest. compute quotas are also not being honored. i am observing our ETL pipeline not getting the configured number of map slots - whereas there's another jobs with no minimum slot guarantees that keeps hogging resources. this other task has a crazy looking deficit (from the advanced scheduler page): Deficit: -17796533s but regardless - i would have thought that honoring quotas would come before any form of fair scheduling (deficit based or not)
        Joydeep Sen Sarma made changes -
        Field Original Value New Value
        Assignee Matei Zaharia [ matei@eecs.berkeley.edu ]
        Hide
        Joydeep Sen Sarma added a comment -

        Ignore the previous comment (which may be due to a bug we are trying to diagnose separately). i would like to stick to the initial report - deficit based scheduling causes pending jobs to rise up in priority (effectively). especially when such jobs are large - they tend to hog the cluster once they can be scheduled.

        i guess this does beg the question of why a large deficit was accumulated in the first place (and that maybe due to a bug) - but this does seem to call for some solution in any case.

        one of the things that i think u had mentioned would make sense - instead of giving all incoming slots to a task with a large deficit - give it a large enough fraction so that it's 'catching up'. one way could be to assign additional weight multiplicator that's proportional to the deficit. this should leave some slots on an ongoing basis to new jobs without a lot of deficit. thoughts?

        Show
        Joydeep Sen Sarma added a comment - Ignore the previous comment (which may be due to a bug we are trying to diagnose separately). i would like to stick to the initial report - deficit based scheduling causes pending jobs to rise up in priority (effectively). especially when such jobs are large - they tend to hog the cluster once they can be scheduled. i guess this does beg the question of why a large deficit was accumulated in the first place (and that maybe due to a bug) - but this does seem to call for some solution in any case. one of the things that i think u had mentioned would make sense - instead of giving all incoming slots to a task with a large deficit - give it a large enough fraction so that it's 'catching up'. one way could be to assign additional weight multiplicator that's proportional to the deficit. this should leave some slots on an ongoing basis to new jobs without a lot of deficit. thoughts?
        Hide
        Matei Zaharia added a comment -

        I agree, I think the "catching up" idea could help here. Basically the problem is the following - if you have a job with long tasks, when it makes it to the head of the queue (max deficit), it may grab a lot of slots and hold onto them for a while. Instead, we should give it a more moderate share - enough that it can "catch up" at a reasonable time. It may also be good to take into account task durations for the job in this equation - i.e. look further ahead for jobs with large tasks.

        Show
        Matei Zaharia added a comment - I agree, I think the "catching up" idea could help here. Basically the problem is the following - if you have a job with long tasks, when it makes it to the head of the queue (max deficit), it may grab a lot of slots and hold onto them for a while. Instead, we should give it a more moderate share - enough that it can "catch up" at a reasonable time. It may also be good to take into account task durations for the job in this equation - i.e. look further ahead for jobs with large tasks.
        Hide
        Matei Zaharia added a comment -

        In fact upon further discussion with Joydeep and Dhruba, we may drop deficits altogether once we add preemption and just use a similar concept for guaranteed shares to make sure pools get their min share in order of how long they've been waiting for it. This will simplify the code and make the scheduler behavior easier to understand.

        Show
        Matei Zaharia added a comment - In fact upon further discussion with Joydeep and Dhruba, we may drop deficits altogether once we add preemption and just use a similar concept for guaranteed shares to make sure pools get their min share in order of how long they've been waiting for it. This will simplify the code and make the scheduler behavior easier to understand.
        Hide
        Hemanth Yamijala added a comment -

        So, do you mean that you won't maintain job deficits at all, and instead you would maintain how much time pools (or jobs) have not got their min share, and sort by that ?

        Show
        Hemanth Yamijala added a comment - So, do you mean that you won't maintain job deficits at all, and instead you would maintain how much time pools (or jobs) have not got their min share, and sort by that ?
        Hide
        Matei Zaharia added a comment -

        Yes exactly. The last time each pool was at its min / fair share is already being maintained by the preemption patch (HADOOP-4665), so it won't be much work. One other benefit of this change will be that jobs will tend to reuse the same slot more often, leading to more JVM reuse. This can be a bad thing if it leads to poor locality, but HADOOP-4667 will ensure that a job keeps using a node till it runs out of local blocks to read on that node, and then waits and switches to hopefully a node where it has more local data to process. This should give us the best of both JVM reuse and data locality. (When I talked to Arun and Owen about the use of deficits in the fair scheduler before, they were concerned that it may lead to less JVM reuse because jobs will jump between slots more often.)

        Show
        Matei Zaharia added a comment - Yes exactly. The last time each pool was at its min / fair share is already being maintained by the preemption patch ( HADOOP-4665 ), so it won't be much work. One other benefit of this change will be that jobs will tend to reuse the same slot more often, leading to more JVM reuse. This can be a bad thing if it leads to poor locality, but HADOOP-4667 will ensure that a job keeps using a node till it runs out of local blocks to read on that node, and then waits and switches to hopefully a node where it has more local data to process. This should give us the best of both JVM reuse and data locality. (When I talked to Arun and Owen about the use of deficits in the fair scheduler before, they were concerned that it may lead to less JVM reuse because jobs will jump between slots more often.)
        Hide
        Matei Zaharia added a comment -

        A note on progress for this issue: I have a tested patch that removes deficits and also adds support for FIFO pools, but I am waiting for HADOOP-4665 and HADOOP-4667 to be committed before posting it because it depends on those.

        Show
        Matei Zaharia added a comment - A note on progress for this issue: I have a tested patch that removes deficits and also adds support for FIFO pools, but I am waiting for HADOOP-4665 and HADOOP-4667 to be committed before posting it because it depends on those.
        Owen O'Malley made changes -
        Project Hadoop Common [ 12310240 ] Hadoop Map/Reduce [ 12310941 ]
        Key HADOOP-4803 MAPREDUCE-543
        Component/s contrib/fair-share [ 12312456 ]
        Hide
        Todd Lipcon added a comment -

        This was incorporated into MAPREDUCE-706, right? Can we link that and close as dup?

        Show
        Todd Lipcon added a comment - This was incorporated into MAPREDUCE-706 , right? Can we link that and close as dup?
        Hide
        Matei Zaharia added a comment -

        Yes indeed, this issue was fixed as part of MAPREDUCE-706, so I'm closing it as a duplicate.

        Show
        Matei Zaharia added a comment - Yes indeed, this issue was fixed as part of MAPREDUCE-706 , so I'm closing it as a duplicate.
        Matei Zaharia made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.21.0 [ 12314045 ]
        Resolution Duplicate [ 3 ]
        Matei Zaharia made changes -
        Assignee Matei Zaharia [ matei@eecs.berkeley.edu ] Matei Zaharia [ matei ]
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        336d 14h 47m 1 Matei Zaharia 10/Nov/09 08:02
        Resolved Resolved Closed Closed
        287d 13h 11m 1 Tom White 24/Aug/10 22:13

          People

          • Assignee:
            Matei Zaharia
            Reporter:
            Joydeep Sen Sarma
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development