Hadoop Common
  1. Hadoop Common
  2. HADOOP-134

JobTracker trapped in a loop if it fails to localize a task

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      The symptoms:

      When I ran jobs on a big cluster, I noticed that some jobs got stucked. Some map tasks never got started. When I look at the log of the task tracker responsible for the tasks, I saw the following exceptions:

      060413 160702 Lost connection to JobTracker [kry1040/72.30.116.100:50020]. Retrying...
      java.io.IOException: No valid local directories in property: mapred.local.dir
      at org.apache.hadoop.conf.Configuration.getFile(Configuration.java:282)
      at org.apache.hadoop.mapred.JobConf.getLocalFile(JobConf.java:127)
      at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:391)
      at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:383)
      at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:270)
      at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:336)
      at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:756)

      The reason for the exception is that the directory hadoop/mapred/local has "wrong" owner, thus the task tracker cannot access to it.
      This caused the task tracker stucked into the following loops:

      while (running) {
      boolean staleState = false;
      try {
      // This while-loop attempts reconnects if we get network errors
      while (running && ! staleState) {
      try {
      if (offerService() == STALE_STATE)

      { staleState = true; }

      } catch (Exception ex) {
      LOG.log(Level.INFO, "Lost connection to JobTracker [" + jobTrackAddr + "]. Retrying...", ex);
      try

      { Thread.sleep(5000); }

      catch (InterruptedException ie) {
      }
      }
      }
      } finally

      { close(); }

      LOG.info("Reinitializing local state");
      initialize();
      }

      Issue 1:
      Method offerService() must catch and handle the exceptions that may be thrown from new TaskInProgress() call, and report back to the job tracker if it cannot run the task. This way, the task can be assigned to other task tracker.

      Issue 2:
      The taskTracker should check whether it can access to the local dir at the initialization time, before taking any tasks.

      Runping

      1. task-startup-safety.patch
        7 kB
        Owen O'Malley
      2. task-startup-safety-2.patch
        8 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Owen!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Owen!
          Hide
          Owen O'Malley added a comment -

          This is the previous patch updated to reflect the Path changes that went in today.

          Show
          Owen O'Malley added a comment - This is the previous patch updated to reflect the Path changes that went in today.
          Hide
          Owen O'Malley added a comment -

          Ok, this patch fixes the problem. In particular,
          1. adds the hostname to the task tracker names.
          2. moves exception-raising code out of the constructor for TaskInProgress
          3. pulls the task start up code into a separate procedure so that I can make sure it doesn't throw any exceptions
          4. records failures during initialization of tasks
          5. moves the stringifyException into the utils package.

          Show
          Owen O'Malley added a comment - Ok, this patch fixes the problem. In particular, 1. adds the hostname to the task tracker names. 2. moves exception-raising code out of the constructor for TaskInProgress 3. pulls the task start up code into a separate procedure so that I can make sure it doesn't throw any exceptions 4. records failures during initialization of tasks 5. moves the stringifyException into the utils package.

            People

            • Assignee:
              Owen O'Malley
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development