Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-554

Improve limit handling in fairshare scheduler

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The fairshare scheduler has a way by which it can limit the number of jobs in a pool by setting the maxRunningJobs parameter in its allocations definition. This limit is treated as a hard limit, and comes into effect even if the cluster is free to run more jobs, resulting in underutilization. Possibly the same thing happens with the parameter maxRunningJobs for user and userMaxJobsDefault. It may help to treat these as a soft limit and run additional jobs to keep the cluster fully utilized.

        Activity

        Hide
        Hemanth Yamijala added a comment -

        Another point to consider (though I don't know if it belongs to the same JIRA) is that currently the scheduler initializes all jobs submitted to the cluster immediately. Initialized jobs add to the memory footprint on the jobtracker. This could impact JT scale and performance with respect to number of jobs, on very busy clusters.

        If limits are set, maybe we can initialize only as many plus a few more of the jobs in the queue so that the memory footprint is kept low. This may help the JT to scale better.

        Show
        Hemanth Yamijala added a comment - Another point to consider (though I don't know if it belongs to the same JIRA) is that currently the scheduler initializes all jobs submitted to the cluster immediately. Initialized jobs add to the memory footprint on the jobtracker. This could impact JT scale and performance with respect to number of jobs, on very busy clusters. If limits are set, maybe we can initialize only as many plus a few more of the jobs in the queue so that the memory footprint is kept low. This may help the JT to scale better.
        Hide
        Matei Zaharia added a comment -

        Both suggestions make a lot of sense. So, to clarify, would the logic look like this?

        • For each pool, initialize and schedule at least its maxRunningJobs number of jobs.
        • If all tasks from all initialized jobs in the pool are running, then initialize further jobs.
        Show
        Matei Zaharia added a comment - Both suggestions make a lot of sense. So, to clarify, would the logic look like this? For each pool, initialize and schedule at least its maxRunningJobs number of jobs. If all tasks from all initialized jobs in the pool are running, then initialize further jobs.
        Hide
        Hemanth Yamijala added a comment -

        If all tasks from all initialized jobs in the pool are running, then initialize further jobs.

        Matei, waiting until all the jobs become running to initialize further jobs may result in some under utilization still, if we only fire initialization when we determine that all previous jobs are running. This is because job initialization is expected to take some time, as it involves DFS access for localizing the job, as well as running the 'Setup task' (which is handled transparently by the Jobtracker). Hence, maybe it is a better idea to pre-initialize a few additional jobs (could be very small), and keep that number a constant. This way we are still bounded, but also have a backlog of jobs to immediately schedule tasks from, if all other jobs become running by then. Does this make sense ?

        Show
        Hemanth Yamijala added a comment - If all tasks from all initialized jobs in the pool are running, then initialize further jobs. Matei, waiting until all the jobs become running to initialize further jobs may result in some under utilization still, if we only fire initialization when we determine that all previous jobs are running. This is because job initialization is expected to take some time, as it involves DFS access for localizing the job, as well as running the 'Setup task' (which is handled transparently by the Jobtracker). Hence, maybe it is a better idea to pre-initialize a few additional jobs (could be very small), and keep that number a constant. This way we are still bounded, but also have a backlog of jobs to immediately schedule tasks from, if all other jobs become running by then. Does this make sense ?
        Hide
        Matei Zaharia added a comment -

        That sounds good. Maybe we'll do it only when (number of unlaunched tasks) < (number of nodes) or perhaps number of task slots.

        Show
        Matei Zaharia added a comment - That sounds good. Maybe we'll do it only when (number of unlaunched tasks) < (number of nodes) or perhaps number of task slots.
        Hide
        Hemanth Yamijala added a comment -

        Hmm. I am wondering if in that case the unlaunched tasks could get scheduled quicker than job initialization, which could potentially take a long time, depending on the users code for setup, or the DFS load etc. It may just be simpler to have an additional 2 or 3 jobs pre-initialized. I agree it is less optimal than your approach though, but it seems simpler to reason about.

        Show
        Hemanth Yamijala added a comment - Hmm. I am wondering if in that case the unlaunched tasks could get scheduled quicker than job initialization, which could potentially take a long time, depending on the users code for setup, or the DFS load etc. It may just be simpler to have an additional 2 or 3 jobs pre-initialized. I agree it is less optimal than your approach though, but it seems simpler to reason about.
        Hide
        Hemanth Yamijala added a comment -

        BTW, the capacity scheduler implemented a similar algorithm for lazy initialization, while balancing the utilization needs of the cluster. If you would like, please take a look at src/contrib/capacity-scheduler/src/o.a.h.m.JobInitializationPoller.java for details.

        Show
        Hemanth Yamijala added a comment - BTW, the capacity scheduler implemented a similar algorithm for lazy initialization, while balancing the utilization needs of the cluster. If you would like, please take a look at src/contrib/capacity-scheduler/src/o.a.h.m.JobInitializationPoller.java for details.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        One more issue with the current model of limiting by number of jobs per pool is the under-utilization of map slots while reduce tasks are running. If N is the maximum number of jobs that can run in a particular pool, when all the maps of first N jobs are done and only the reduces are running, the map slots sit idle and the maps of next N batch of jobs cannot yet be scheduled. The fix for this is to change the limits to be in terms of tasks/slots instead of the number of jobs.

        Show
        Vinod Kumar Vavilapalli added a comment - One more issue with the current model of limiting by number of jobs per pool is the under-utilization of map slots while reduce tasks are running. If N is the maximum number of jobs that can run in a particular pool, when all the maps of first N jobs are done and only the reduces are running, the map slots sit idle and the maps of next N batch of jobs cannot yet be scheduled. The fix for this is to change the limits to be in terms of tasks/slots instead of the number of jobs.
        Hide
        Matei Zaharia added a comment -

        A note on progress for this issue: I have a tested patch that removes deficits and also adds support for FIFO pools (but not yet lazy job init as discussed here). However, I am waiting for HADOOP-4665 and HADOOP-4667 to be committed before posting it because it depends on those.

        Show
        Matei Zaharia added a comment - A note on progress for this issue: I have a tested patch that removes deficits and also adds support for FIFO pools (but not yet lazy job init as discussed here). However, I am waiting for HADOOP-4665 and HADOOP-4667 to be committed before posting it because it depends on those.

          People

          • Assignee:
            Unassigned
            Reporter:
            Hemanth Yamijala
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development