Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2959

Fair Scheduler "fifo" option can violate FIFO behavior and cause deadlock among jobs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • fairscheduler
    • None

    Description

      We have a cluster which run jobs in fifo order(due to the nature of those jobs) using Fair scheduler's "fifo" option.
      Recently we found jobs deadlocked in the cluster, here is what happened :
      There were two jobs,say A and B. A was submitted before B.
      Both were in PENDING state since the cluster was busy.
      When containers freed up, the two pending jobs got their AM containers at about the same time.
      However Job B's AM or appattempt1 registered with RM a little earlier than Job A and grabbed available containers at that time, and satisfied a fraction of its requirement. Note, JobB can't make progress until it gets all its requirement satisfied.
      Next, JobA's appattempt1 registered with RM and since JobA was submitted earlier, RM stops allocating containers to JobB and starts allocating to JobA, satisfying a fraction of its requirement as well.
      Now together jobA,jobB hold the entire cluster, but neither can progress and are deadlocked since their resource requests are partially satisfied.

      Note:Above is an example with 2 jobs, however the deadlock can happen with n jobs : J1..Jn if the sequence of AM registration is Jn, J(n-1),..J1.

      Solution : one proposed solution is to order the fifo queue by appattempt start/register time instead of app submit time.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ashwinshankar77 Ashwin Shankar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: