Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3491

Tez job can hang due to container priority inversion

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.7.1
    • Fix Version/s: 0.9.0, 0.8.5
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      If the Tez AM receives containers at a lower priority than the highest priority task being requested then it fails to assign the container to any task. In addition if the container is new then it refuses to release it if there are any pending tasks. If it takes too long for the higher priority requests to be fulfilled (e.g.: the lower priority containers are filling the queue) then eventually YARN will expire the unused lower priority containers since they were never launched. The Tez AM then never re-requests these lower priority containers and the job hangs because the AM is waiting for containers from the RM that the RM already sent and expired.

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          When the containers expire the Tez AM emits logs like this:

          2016-10-23 01:36:49,207 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Ignoring unknown container: container_e08_1475789370361_492567_01_000166
          2016-10-23 01:36:49,292 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: Skipping delayed container as container is no longer running, containerId=container_e08_1475789370361_492567_01_000166
          

          I can see a couple of approaches to fix this:
          1) Release the lower priority container to make sure we free up enough space to allocate the necessary high-priority containers to satisfy the top priority requests. These released containers need to be re-requested if there are still pending requests at the container's priority.

          2) Allow the lower priority container to be used by a lower priority task. We risk a similar priority inversion problem here if the lower priority task ends up waiting for the higher priority task to complete and needs to free up its resources for that to happen (e.g.: reducer waiting for upstream task but queue is full). However the existing preemption logic should cover this scenario since it can happen anyway (via fetch-failed task re-runs).

          I'm slightly leaning towards option 2) since there are many cases where the lower priority task can complete on its own (i.e.: has no dependencies on the pending higher-priority tasks), and we have an allocation in hand to start working on that task.

          Note another related problem that should be addressed is when we lose containers due to expiration. Currently if any container allocation expires the Tez AM is going to drop it without re-requesting it. This is going to either lead to reduced performance if container reuse allows the AM to funnel the tasks through fewer containers or an outright hang if it cannot reuse other containers.

          Show
          jlowe Jason Lowe added a comment - When the containers expire the Tez AM emits logs like this: 2016-10-23 01:36:49,207 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Ignoring unknown container: container_e08_1475789370361_492567_01_000166 2016-10-23 01:36:49,292 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: Skipping delayed container as container is no longer running, containerId=container_e08_1475789370361_492567_01_000166 I can see a couple of approaches to fix this: 1) Release the lower priority container to make sure we free up enough space to allocate the necessary high-priority containers to satisfy the top priority requests. These released containers need to be re-requested if there are still pending requests at the container's priority. 2) Allow the lower priority container to be used by a lower priority task. We risk a similar priority inversion problem here if the lower priority task ends up waiting for the higher priority task to complete and needs to free up its resources for that to happen (e.g.: reducer waiting for upstream task but queue is full). However the existing preemption logic should cover this scenario since it can happen anyway (via fetch-failed task re-runs). I'm slightly leaning towards option 2) since there are many cases where the lower priority task can complete on its own (i.e.: has no dependencies on the pending higher-priority tasks), and we have an allocation in hand to start working on that task. Note another related problem that should be addressed is when we lose containers due to expiration. Currently if any container allocation expires the Tez AM is going to drop it without re-requesting it. This is going to either lead to reduced performance if container reuse allows the AM to funnel the tasks through fewer containers or an outright hang if it cannot reuse other containers.
          Hide
          jlowe Jason Lowe added a comment -

          Here's a patch that just addresses the hang issue. The AM is getting out of sync with the RM because it ignores the completion events for those containers when they expire. This patch causes the AM to reschedule containers when they complete if it can find outstanding requests at the same priority as the container.

          Note that there's still the significant scheduling performance issue that is not fixed by the patch, namely the AM refusing to schedule new containers that are lower priority than the top priority pending requests and also not discarding them until they expire. We may integrate this fix and postpone addressing that to another JIRA so we can at least keep the Tez jobs from hanging completely in this scenario.

          Show
          jlowe Jason Lowe added a comment - Here's a patch that just addresses the hang issue. The AM is getting out of sync with the RM because it ignores the completion events for those containers when they expire. This patch causes the AM to reschedule containers when they complete if it can find outstanding requests at the same priority as the container. Note that there's still the significant scheduling performance issue that is not fixed by the patch, namely the AM refusing to schedule new containers that are lower priority than the top priority pending requests and also not discarding them until they expire. We may integrate this fix and postpone addressing that to another JIRA so we can at least keep the Tez jobs from hanging completely in this scenario.
          Hide
          tezqa TezQA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12835207/TEZ-3491.001.patch
          against master revision f735f48.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in :
          org.apache.tez.dag.app.dag.impl.TestCommit

          Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2061//testReport/
          Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2061//console

          This message is automatically generated.

          Show
          tezqa TezQA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12835207/TEZ-3491.001.patch against master revision f735f48. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 findbugs . The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in : org.apache.tez.dag.app.dag.impl.TestCommit Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2061//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2061//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          The test failure appears to be unrelated. Looks like a known issue being tracked by TEZ-3097, and the test passes for me locally with the patch applied.

          Show
          jlowe Jason Lowe added a comment - The test failure appears to be unrelated. Looks like a known issue being tracked by TEZ-3097 , and the test passes for me locally with the patch applied.
          Hide
          sseth Siddharth Seth added a comment -

          Jason Lowe - the patch looks good to me. +1. Like you pointed out, it's not ideal to wait for timeouts and then make container requests again.
          What scenario led to this situation. YARN ended up handing out a lower priority ask before a higher priority ask?, or did the AM decide to release containers that it had obtained at a higher priority, or multiple DAGs in the same AM?

          You had mentioned an alternate approach / potential improvement to solve the same problem in an offline discussion. Have a little more context after looking at the code again. Would be useful if you could add some notes about that on the jira.

          For sessions, where containers can be used across multiple DAGs - I think the logic of avoiding lower priority container for a higher priority task because of the risk of 'polluting' the container goes for a toss. The held containers could be at any priority level. Maybe we should have a mode where the container priority check is not enforced. This would work for most submissions. It'll be problematic if the same file, same jar or conflicting jars are specified as LRs for different Vertices / DAGs.

          Show
          sseth Siddharth Seth added a comment - Jason Lowe - the patch looks good to me. +1. Like you pointed out, it's not ideal to wait for timeouts and then make container requests again. What scenario led to this situation. YARN ended up handing out a lower priority ask before a higher priority ask?, or did the AM decide to release containers that it had obtained at a higher priority, or multiple DAGs in the same AM? You had mentioned an alternate approach / potential improvement to solve the same problem in an offline discussion. Have a little more context after looking at the code again. Would be useful if you could add some notes about that on the jira. For sessions, where containers can be used across multiple DAGs - I think the logic of avoiding lower priority container for a higher priority task because of the risk of 'polluting' the container goes for a toss. The held containers could be at any priority level. Maybe we should have a mode where the container priority check is not enforced. This would work for most submissions. It'll be problematic if the same file, same jar or conflicting jars are specified as LRs for different Vertices / DAGs.
          Hide
          jlowe Jason Lowe added a comment -

          Filed TEZ-3535 to track improving what the scheduler does when it receives containers that are lower priority than the top priority task requests.

          What scenario led to this situation. YARN ended up handing out a lower priority ask before a higher priority ask?, or did the AM decide to release containers that it had obtained at a higher priority, or multiple DAGs in the same AM?

          It was a resource request race. This occurred in a DAG that had a lot of root vertices. Vertex 0 had a relatively slow input initializer, so all the other root vertices initialized a few hundred milliseconds before vertex 0. That allowed the YarnTaskScheduler to receive all the task requests for the lower priority root vertices and request them from the RM. By the time those hundreds of containers arrived, vertex 0 had finished initializing and made a request for thousands of top-priority tasks. So YarnTaskScheduler refused to assign the lower priority containers to the top-priority tasks and held onto them. It took more than 10 minutes for the thousands of top-priority tasks from vertex 0 to be scheduled (the hundreds of low-priority, unused container allocations weren't helping here), and that caused the RM to expire the lower priority allocations. When vertex 0 finally finished the containers weren't reusable for the other vertices (due to locality reuse constraints), so the job hung. The expired, lower priority containers were never re-requested, so the Tez AM thought it was still going to receive allocations that the RM had already granted and expired.

          Note that this could occur in other DAGs between parallel paths in the graph. Just as one path starts requesting lower priority tasks the other path also requests higher priority tasks. If the scheduler heartbeats in-between those request batches, the RM can grant containers that are lower priority than the highest priority task request. The scheduler holds onto those, and we get poor performance.

          You had mentioned an alternate approach / potential improvement to solve the same problem in an offline discussion.

          I was thinking we should treat newly allocated containers like the non-reuse case when they arrive, i.e.: we lookup task requests at the container's priority and assign them even if they aren't the highest priority. YARN shouldn't be giving us those containers unless it has already allocated containers for all other higher-priority requests or there's a request race (like above). I argue that we can just assign them to their natural priority requests directly even if there's a request race since we have to solve that problem anyway. We already can get into situations where lower priority tasks are requested, allocated, and assigned before higher priority requests arrive, so this would be no different. Certainly holding onto the lower priority containers for an indefinite time period is not the right thing to do. If we can't assign them then we should release them, but assigning them would be preferable. If the new containers can't be assigned because there are no outstanding requests at that priority then we can fall back to the normal reuse logic to try to use them. I have a prototype patch that I can post to TEZ-3535.

          Thanks for the review! I'll commit this later today.

          Show
          jlowe Jason Lowe added a comment - Filed TEZ-3535 to track improving what the scheduler does when it receives containers that are lower priority than the top priority task requests. What scenario led to this situation. YARN ended up handing out a lower priority ask before a higher priority ask?, or did the AM decide to release containers that it had obtained at a higher priority, or multiple DAGs in the same AM? It was a resource request race. This occurred in a DAG that had a lot of root vertices. Vertex 0 had a relatively slow input initializer, so all the other root vertices initialized a few hundred milliseconds before vertex 0. That allowed the YarnTaskScheduler to receive all the task requests for the lower priority root vertices and request them from the RM. By the time those hundreds of containers arrived, vertex 0 had finished initializing and made a request for thousands of top-priority tasks. So YarnTaskScheduler refused to assign the lower priority containers to the top-priority tasks and held onto them. It took more than 10 minutes for the thousands of top-priority tasks from vertex 0 to be scheduled (the hundreds of low-priority, unused container allocations weren't helping here), and that caused the RM to expire the lower priority allocations. When vertex 0 finally finished the containers weren't reusable for the other vertices (due to locality reuse constraints), so the job hung. The expired, lower priority containers were never re-requested, so the Tez AM thought it was still going to receive allocations that the RM had already granted and expired. Note that this could occur in other DAGs between parallel paths in the graph. Just as one path starts requesting lower priority tasks the other path also requests higher priority tasks. If the scheduler heartbeats in-between those request batches, the RM can grant containers that are lower priority than the highest priority task request. The scheduler holds onto those, and we get poor performance. You had mentioned an alternate approach / potential improvement to solve the same problem in an offline discussion. I was thinking we should treat newly allocated containers like the non-reuse case when they arrive, i.e.: we lookup task requests at the container's priority and assign them even if they aren't the highest priority. YARN shouldn't be giving us those containers unless it has already allocated containers for all other higher-priority requests or there's a request race (like above). I argue that we can just assign them to their natural priority requests directly even if there's a request race since we have to solve that problem anyway. We already can get into situations where lower priority tasks are requested, allocated, and assigned before higher priority requests arrive, so this would be no different. Certainly holding onto the lower priority containers for an indefinite time period is not the right thing to do. If we can't assign them then we should release them, but assigning them would be preferable. If the new containers can't be assigned because there are no outstanding requests at that priority then we can fall back to the normal reuse logic to try to use them. I have a prototype patch that I can post to TEZ-3535 . Thanks for the review! I'll commit this later today.
          Hide
          jlowe Jason Lowe added a comment -

          I committed this to master and branch-0.8.

          Show
          jlowe Jason Lowe added a comment - I committed this to master and branch-0.8.
          Hide
          sseth Siddharth Seth added a comment -

          Got it. Wondering if it would help to go back to the behaviour where vertices at the same level generate the same priority - if they have similar container requests. I think that minimizes the chances of hitting this. Like you pointed out, it is still possible if a vertex at a lower priority ends up starting before a vertex at a higher priority, on a parallel branch.
          With the changed priority model, there's the additional change that the scheduler will try finishing tasks for a certain vertex, instead of randomly allocating tasks across vertices at the same level. That was a change we wanted to achieve - via different means though.

          I was thinking we should treat newly allocated containers like the non-reuse case when they arrive, i.e.: we lookup task requests at the container's priority and assign them even if they aren't the highest priority.

          This would result in preemption if containers are not available for tasks at a higher priority level, correct? Makes sense to me - it's better than holding on to them and doing nothing.

          I'd prefer ignoring priorities while assigning containers. Track situations where this happen, or check the pending request table to see if requests are outstanding with YARN - and make additional requests. Do you think that will work?

          Show
          sseth Siddharth Seth added a comment - Got it. Wondering if it would help to go back to the behaviour where vertices at the same level generate the same priority - if they have similar container requests. I think that minimizes the chances of hitting this. Like you pointed out, it is still possible if a vertex at a lower priority ends up starting before a vertex at a higher priority, on a parallel branch. With the changed priority model, there's the additional change that the scheduler will try finishing tasks for a certain vertex, instead of randomly allocating tasks across vertices at the same level. That was a change we wanted to achieve - via different means though. I was thinking we should treat newly allocated containers like the non-reuse case when they arrive, i.e.: we lookup task requests at the container's priority and assign them even if they aren't the highest priority. This would result in preemption if containers are not available for tasks at a higher priority level, correct? Makes sense to me - it's better than holding on to them and doing nothing. I'd prefer ignoring priorities while assigning containers. Track situations where this happen, or check the pending request table to see if requests are outstanding with YARN - and make additional requests. Do you think that will work?
          Hide
          jlowe Jason Lowe added a comment -

          I'd prefer ignoring priorities while assigning containers. Track situations where this happen, or check the pending request table to see if requests are outstanding with YARN - and make additional requests. Do you think that will work?

          I think it could. Probably a more involved change but I think it would execute more efficiently in practice. If we go that route then I would argue Tez should avoid using priorities in YARN as much as possible. Allocate all containers at the same priority and assign them based on task priority. This doesn't work in practice due to the different-container-sizes-at-the-same-priority limitation in YARN that was only recently fixed, so it would still need to change the priority when the containers are different sizes.

          We can continue the discussion on TEZ-3535.

          Show
          jlowe Jason Lowe added a comment - I'd prefer ignoring priorities while assigning containers. Track situations where this happen, or check the pending request table to see if requests are outstanding with YARN - and make additional requests. Do you think that will work? I think it could. Probably a more involved change but I think it would execute more efficiently in practice. If we go that route then I would argue Tez should avoid using priorities in YARN as much as possible. Allocate all containers at the same priority and assign them based on task priority. This doesn't work in practice due to the different-container-sizes-at-the-same-priority limitation in YARN that was only recently fixed, so it would still need to change the priority when the containers are different sizes. We can continue the discussion on TEZ-3535 .

            People

            • Assignee:
              jlowe Jason Lowe
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development