JobInProgress.initTasks takes significant amount of time on a large cluster for large jobs (55k maps * 3 splits), during which the JobInProgress object is locked up.
Simultaneously the JobClient is calling JobTracker.getTaskCompletionEvents which locks the JobTracker & tries to lock the JobInProgress, there-by it starves all heartbeats which are trying to lock the JobTracker - resulting in a lockup.