Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-915

TaskScheduler can get hung when all headroom is used and it cannot utilize existing new containers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.4.0
    • None
    • None

    Description

      If there are pending unmatched requests and reused containers then those containers are released to create space for new allocations that may match.

      However, if there are pending unmatched requests and new containers then those dont end up being released and the scheduler may hang.

      One scenario where this could happen is when we get a pri4 container for a pri4 request. Before we match that, we also get a pri1 request (lets say for failed re-execution). Now the pri1 tasks is the highest pri and we always scheduled that first. However, it may not match the container. If there is no headroom, the RM will not give us a new pr1 container and we will hang.

      The above case needs to be handled in the preemption logic. When we release the pri4 container we need to make a new request for that resource in order to ensure that the RM will give it back to us again after it has allocated the pri1 container because currently the RM thinks it has satisfied our initial pri4 request.

      Attachments

        1. TEZ-915.1.patch
          12 kB
          Bikas Saha
        2. TEZ-915.2.patch
          12 kB
          Bikas Saha

        Activity

          People

            bikassaha Bikas Saha
            bikassaha Bikas Saha
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: