In a discussion offline, Amareshwari explained that this issue occurs if a setup/cleanup task is running on the TT that subsequently becomes lost and the task moves to a KILLED_UNCLEAN state. This makes the setup/cleanup task to be incorrectly added to the list of tasks that need cleanup.
Confirmed the same from the logs.
2009-02-28 23:05:41,986 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200902261046_9662_m_007800_0' to tip task_200902261046_9662_m_007800, for tracker '<tracker_host:port>'
2009-02-28 23:17:14,800 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200902261046_9662_m_007800_0: Lost task tracker: <tracker_host:port>
The result is that the job's cleanup task got stuck, it is shown to be in pending state on the JT UI. No subsequent attempts are launched for the cleanup task. And the job hangs in there like that. I tried killing the cleanup attempt from the client command line, thinking it might get rescheduled, but it fails with message "Could not kill task attempt_200902261046_9662_m_007800_0". Even killing the job didn't work