Details
Description
This is reliably reproduced while running CDH4.1.2 or CDH4.2.0 on a single Mac OS X machine.
- Two queues are being configured: cjmQ and slotsQ. Both queues are configured with tiny minResources. The intention is for the task(s) of the job in cjmQ to be able to preempt tasks of the job in slotsQ.
- yarn.nodemanager.resource.memory-mb = 24576
- First, a long-running 6-map-task (0 reducers) mapreduce job is started in slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container consumes some memory, only 5 of its 6 map tasks are able to start, and the 6th is pending, but will never run.
- Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted via cjmQ with mapreduce.map.memory.mb=2048.
Expected behavior:
At this point, because the minimum share of cjmQ has not been met, I expected Fair Scheduler to preempt one of the executing map tasks from the single slotsQ mapreduce job to make room for the single map tasks of the cjmQ mapreduce job. However, Fair Scheduler didn't preempt any of the running map tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually. Since slotsQ had far more than its minimum share allocated to it and already running, while cjmQ was far below its minimum share (0 actually), Fair Scheduler should have started preempting, regardless of there being one task container from the slotsQ job (the 6th map container) that was not being allocated.
Additional useful info:
- If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ mapreduce job in that Q gets scheduled and its state changes to RUNNING; once that that first job completes, then the second job submitted via cjmQ gets starved until a third job is submitted into cjmQ, and so on. This happens regardless of the values of maxRunningApps in the queue configurations.
- If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 so that everything fits nicely into yarn.nodemanager.resource.memory-mb - without that 6th pending, but not running task - then preemption works as I would have expected. However, I cannot rely on this arrangement because in a production cluster that is running at full capacity, if a machine dies, the mapreduce job from slotsQ will request new containers for the failed tasks and because the cluster was already at capacity, those containers will end up as pending and will never run, recreating my original scenario of the starving cjmQ job.
- I initially wrote this up on https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM, so it would be good to update that group with the resolution.
Configuration:
In yarn-site.xml:
<property> <description>Scheduler plug-in class to use instead of the default scheduler.</description> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property>
fair-scheduler.xml:
<configuration> <!-- Site specific FairScheduler configuration properties --> <property> <description>Absolute path to allocation file. An allocation file is an XML manifest describing queues and their properties, in addition to certain policy defaults. This file must be in XML format as described in http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. </description> <name>yarn.scheduler.fair.allocation.file</name> <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value> </property> <property> <description>Whether to use preemption. Note that preemption is experimental in the current version. Defaults to false.</description> <name>yarn.scheduler.fair.preemption</name> <value>true</value> </property> <property> <description>Whether to allow multiple container assignments in one heartbeat. Defaults to false.</description> <name>yarn.scheduler.fair.assignmultiple</name> <value>true</value> </property> </configuration>
My fair-scheduler-allocations.xml:
<allocations> <queue name="cjmQ"> <!-- minimum amount of aggregate memory; TODO which units??? --> <minResources>2048</minResources> <!-- limit the number of apps from the queue to run at once --> <maxRunningApps>1</maxRunningApps> <!-- either "fifo" or "fair" depending on the in-queue scheduling policy desired --> <schedulingMode>fifo</schedulingMode> <!-- Number of seconds after which the pool can preempt other pools' tasks to achieve its min share. Requires preemption to be enabled in mapred-site.xml by setting mapred.fairscheduler.preemption to true. Defaults to infinity (no preemption). --> <minSharePreemptionTimeout>5</minSharePreemptionTimeout> <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. --> <weight>1.0</weight> </queue> <queue name="slotsQ"> <!-- minimum amount of aggregate memory; TODO which units??? --> <minResources>1</minResources> <!-- limit the number of apps from the queue to run at once --> <maxRunningApps>1</maxRunningApps> <!-- Number of seconds after which the pool can preempt other pools' tasks to achieve its min share. Requires preemption to be enabled in mapred-site.xml by setting mapred.fairscheduler.preemption to true. Defaults to infinity (no preemption). --> <minSharePreemptionTimeout>5</minSharePreemptionTimeout> <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. --> <weight>1.0</weight> </queue> <!-- number of seconds a queue is under its fair share before it will try to preempt containers to take resources from other queues. --> <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout> </allocations>