Filed TEZ-3535 to track improving what the scheduler does when it receives containers that are lower priority than the top priority task requests.
What scenario led to this situation. YARN ended up handing out a lower priority ask before a higher priority ask?, or did the AM decide to release containers that it had obtained at a higher priority, or multiple DAGs in the same AM?
It was a resource request race. This occurred in a DAG that had a lot of root vertices. Vertex 0 had a relatively slow input initializer, so all the other root vertices initialized a few hundred milliseconds before vertex 0. That allowed the YarnTaskScheduler to receive all the task requests for the lower priority root vertices and request them from the RM. By the time those hundreds of containers arrived, vertex 0 had finished initializing and made a request for thousands of top-priority tasks. So YarnTaskScheduler refused to assign the lower priority containers to the top-priority tasks and held onto them. It took more than 10 minutes for the thousands of top-priority tasks from vertex 0 to be scheduled (the hundreds of low-priority, unused container allocations weren't helping here), and that caused the RM to expire the lower priority allocations. When vertex 0 finally finished the containers weren't reusable for the other vertices (due to locality reuse constraints), so the job hung. The expired, lower priority containers were never re-requested, so the Tez AM thought it was still going to receive allocations that the RM had already granted and expired.
Note that this could occur in other DAGs between parallel paths in the graph. Just as one path starts requesting lower priority tasks the other path also requests higher priority tasks. If the scheduler heartbeats in-between those request batches, the RM can grant containers that are lower priority than the highest priority task request. The scheduler holds onto those, and we get poor performance.
You had mentioned an alternate approach / potential improvement to solve the same problem in an offline discussion.
I was thinking we should treat newly allocated containers like the non-reuse case when they arrive, i.e.: we lookup task requests at the container's priority and assign them even if they aren't the highest priority. YARN shouldn't be giving us those containers unless it has already allocated containers for all other higher-priority requests or there's a request race (like above). I argue that we can just assign them to their natural priority requests directly even if there's a request race since we have to solve that problem anyway. We already can get into situations where lower priority tasks are requested, allocated, and assigned before higher priority requests arrive, so this would be no different. Certainly holding onto the lower priority containers for an indefinite time period is not the right thing to do. If we can't assign them then we should release them, but assigning them would be preferable. If the new containers can't be assigned because there are no outstanding requests at that priority then we can fall back to the normal reuse logic to try to use them. I have a prototype patch that I can post to TEZ-3535.
Thanks for the review! I'll commit this later today.