Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.23.3, 2.0.1-alpha
-
None
Description
When a user runs a job where one of the input files is a large file on another cluster, the job can create many splits on nodes which are unreachable for computation from the current cluster. The off-switch delay logic in LeafQueue can cause the ResourceManager to allocate containers for the job very slowly. In one case the job was only getting one container every 23 seconds, and the queue had plenty of spare capacity.