Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1475

local filecache disappears

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      All our jobs on a 600 node cluster fail. Symptom is that the local filecache disappears.

      It might have to do with the fact that lost task trackers get re-initialized when they send a heartbeat again, and purge the local directory completely without updating the filecache.

      Side issue is;
      why do we get so many lost tasktrackers which then resume the heartbeat (a kind of 'bogus' lost tasktracker)?. We lost tasktrackers:
      13 in the 1st hour of the job
      18 in the 2nd hour
      33 in the 3rd hour
      Then the job failed.

      E.g. all the tasktrackers lost in the first 2 hours of the job got logged sometime later with a 'Status from unknown Tracker' in the jobtracker log and got reinitialized.

      I attach some jobracker log messages showing how the heartbeat of the lost tasktrackers come in late, sometimes less than 1 minute late, sometimes up to 16 minutes. What could be the reason? Do the heartbeats get lost?

      2007-06-07 13:09:08,518 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_070
      2007-06-07 13:09:48,919 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_070

      2007-06-07 13:39:08,740 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_075
      2007-06-07 13:41:50,810 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_075

      2007-06-07 14:32:29,093 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_082
      2007-06-07 14:35:34,217 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_082

      2007-06-07 14:15:48,856 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_085
      2007-06-07 14:20:21,337 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_085

      2007-06-07 15:25:49,524 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_098
      2007-06-07 15:33:56,732 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_098

      2007-06-07 14:49:09,203 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_106
      2007-06-07 14:54:25,538 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_106

      2007-06-07 15:02:29,337 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_108
      2007-06-07 15:02:57,558 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_108

      2007-06-07 14:19:09,022 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_112
      2007-06-07 14:19:15,273 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_112

      2007-06-07 14:19:08,881 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_114
      2007-06-07 14:30:03,354 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_114

      2007-06-07 15:42:29,579 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_116
      2007-06-07 15:43:06,422 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_116

      2007-06-07 14:55:49,280 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_117
      2007-06-07 14:56:38,452 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_117

      2007-06-07 15:15:49,461 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_120
      2007-06-07 15:31:37,028 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_120

      2007-06-07 15:09:09,435 INFO org.apache.hadoop.mapred.JobTracker: Lost tracker tracker_174
      2007-06-07 15:18:31,254 WARN org.apache.hadoop.mapred.JobTracker: Status_from_unknown_Tracker : tracker_174

        Attachments

        1. dist-cache-purge.patch
          2 kB
          Owen O'Malley

          Activity

            People

            • Assignee:
              omalley Owen O'Malley
              Reporter:
              ckunz Christian Kunz
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: