When short jobs are executed in hadoop with OutOfBandHeardBeat=true, JT executes heartBeat() method heavily. This internally makes a call to CapacityTaskScheduler.updateQSIObjects().
CapacityTaskScheduler.updateQSIObjects(), internally calls String.format() for setting the job scheduling information. Based on the datastructure size of "jobQueuesManager" and "queueInfoMap", the number of times String.format() gets executed becomes very high. String.format() internally does pattern matching which turns to be out very heavy (This was revealed while profiling JT. Almost 57% of time was spent in CapacityScheduler.assignTasks(), out of which String.format() took 46%.
Would it be possible to do String.format() only at the time of invoking JobInProgress.getSchedulingInfo?. This might reduce the pressure on JT while processing heartbeats.
|Field||Original Value||New Value|
|Assignee||Amar Kamat [ amar_kamat ]|
|Release Note||Incremental enhancements to the JobTracker to optimize heartbeat handling.|
|Summary||reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects||Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and Counters.makeEscapedString()|
|Component/s||jobtracker [ 12312907 ]|
|Assignee||Amar Kamat [ amar_kamat ]||Dick King [ dking ]|
|Status||Patch Available [ 10002 ]||Open [ 1 ]|
|Status||Patch Available [ 10002 ]||Resolved [ 5 ]|
|Resolution||Fixed [ 1 ]|
|Fix Version/s||0.22.0 [ 12314184 ]|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|2d 19h 47m||1||Dick King||24/May/10 18:29|
|85d 17h 36m||2||Dick King||24/May/10 18:30|
|12d 11h 42m||1||Chris Douglas||06/Jun/10 06:13|
|554d 5m||1||Konstantin Shvachko||12/Dec/11 06:19|