I was running hadoop on 800 machines and after running a couple of jobs, and running 100% of the maps of the current job, the JobTracker stopped responding - all tasktrackers were lost ... When I looked at the JT logs, these seemed alarming:
2007-12-26 19:18:30,185 WARN org.apache.hadoop.ipc.Server: Exception in Responder java.util.ConcurrentModificationException
Following the above exception, I saw a whole lot of exceptions like:
2007-12-26 19:23:10,926 WARN org.apache.hadoop.ipc.Server: Call queue overflow discarding oldest call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@5a05f9, false, true, 1758) from 126.96.36.199:1234
From the number of exceptions to do with call queue overflow, it seemed like the jobtracker was not processing RPCs after it got the ConcurrentModificationException, and around that time the tasktrackers started getting timeouts on RPCs...
There were two occurrences of the ConcurrentModificationException but the first instance seemed to not have any effect on the call queue...