Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2441

regression: maximum limit of -1 + user-lmit math appears to be off

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Won't Fix
    • Affects Version/s: 0.20.203.0
    • Fix Version/s: None
    • Component/s: capacity-sched
    • Labels:
      None

      Description

      The math around the slot usage when maximum-capacity=-1 appears to be faulty. See comments.

      1. capsched.xml
        1 kB
        Allen Wittenauer

        Activity

        Hide
        Allen Wittenauer added a comment -

        This is our capacity scheduler configuration. On a test grid with 762 map slots, the first user in running in the default queue only got 266 map slots. This doesn't appear to be either the user limit or the max limit.

        Show
        Allen Wittenauer added a comment - This is our capacity scheduler configuration. On a test grid with 762 map slots, the first user in running in the default queue only got 266 map slots. This doesn't appear to be either the user limit or the max limit.
        Hide
        Allen Wittenauer added a comment -

        or the queue limit.

        So where does the 266 come from? The job was a terasort job with 1000 map tasks.

        Show
        Allen Wittenauer added a comment - or the queue limit. So where does the 266 come from? The job was a terasort job with 1000 map tasks.
        Hide
        Allen Wittenauer added a comment -

        Actually, it looks like queue spillage/task stealing doesn't work at all, whether it is -1 or not. The problem code appears to be in assignSlotsToJob which appears to have replaced the two-phase system in previous versions with a single phase. This single phase does this check to determine the limit:

        int limit =
              Math.min(
                  Math.max(divideAndCeil(currentCapacity, activeUsers),
                           divideAndCeil(ulMin*currentCapacity, 100)),
                  (int)(queueCapacity * ulMinFactor)
                  );
        
        

        In a two queue system where one is -1 and the other is a number, the maximum queue capacity ends up being set to the remainder. Without a second pass, any additional slots from other queues are essentially ignored.

        Show
        Allen Wittenauer added a comment - Actually, it looks like queue spillage/task stealing doesn't work at all, whether it is -1 or not. The problem code appears to be in assignSlotsToJob which appears to have replaced the two-phase system in previous versions with a single phase. This single phase does this check to determine the limit: int limit = Math .min( Math .max(divideAndCeil(currentCapacity, activeUsers), divideAndCeil(ulMin*currentCapacity, 100)), ( int )(queueCapacity * ulMinFactor) ); In a two queue system where one is -1 and the other is a number, the maximum queue capacity ends up being set to the remainder. Without a second pass, any additional slots from other queues are essentially ignored.
        Hide
        Allen Wittenauer added a comment -

        Changing this from a blocker, since no one but me apparently cares that capacity scheduler doesn't actually work as advertised.

        Show
        Allen Wittenauer added a comment - Changing this from a blocker, since no one but me apparently cares that capacity scheduler doesn't actually work as advertised.
        Hide
        Allen Wittenauer added a comment -

        Actually, let me correct myself. Task stealing does work--but in a sort of weird and unpredictable way. Basically, an individual user is limited to the "natural" size of the queue they submitted. So if two users are in the same queue that queue can steal up to 2xqueue size, etc.

        Show
        Allen Wittenauer added a comment - Actually, let me correct myself. Task stealing does work--but in a sort of weird and unpredictable way. Basically, an individual user is limited to the "natural" size of the queue they submitted. So if two users are in the same queue that queue can steal up to 2xqueue size, etc.
        Hide
        Arun C Murthy added a comment -

        Allen, I'm sorry I missed this ticket.

        As we briefly spoke over IM previously, the CS in 0.20.203 is designed to not allow a single user to go over the natural limit of the queue. As in the docs, you'll need to set the user-limit-factor for the queue to allow a user to go over... I'm pretty sure I told you on in person

        Show
        Arun C Murthy added a comment - Allen, I'm sorry I missed this ticket. As we briefly spoke over IM previously, the CS in 0.20.203 is designed to not allow a single user to go over the natural limit of the queue. As in the docs, you'll need to set the user-limit-factor for the queue to allow a user to go over... I'm pretty sure I told you on in person
        Hide
        Allen Wittenauer added a comment -

        Nope, not about user-limit-factor. But doesn't this mean that the first jobs in an expanding queue can starve out jobs in another queue? In other words, if I have:

        job1 = max-lim -1 queue
        job2 = max-lim -1 queue
        job3 = max-lim % queue

        job1 and job2 could take all slots before job3 gets executed, especially if they are submitted by the same user and that is the only user in the job submission queue.

        Show
        Allen Wittenauer added a comment - Nope, not about user-limit-factor. But doesn't this mean that the first jobs in an expanding queue can starve out jobs in another queue? In other words, if I have: job1 = max-lim -1 queue job2 = max-lim -1 queue job3 = max-lim % queue job1 and job2 could take all slots before job3 gets executed, especially if they are submitted by the same user and that is the only user in the job submission queue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Allen Wittenauer
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development