Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.13.0
-
None
-
None
Description
All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.
It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.
Side issue is;
why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
13 in the 1st hour of the job
18 in the 2nd hour
33 in the 3rd hour
Then the job failed.
E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.
I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?
2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070
2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075
2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082
2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085
2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098
2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106
2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108
2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112
2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114
2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116
2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117
2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120
2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174