Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.1.0
-
None
-
None
Description
The symptoms:
When I ran jobs on a big cluster, I noticed that some jobs got stucked. Some map tasks never got started. When I look at the log of the task tracker responsible for the tasks, I saw the following exceptions:
060413 160702 Lost connection to JobTracker [kry1040/72.30.116.100:50020]. Retrying...
java.io.IOException: No valid local directories in property: mapred.local.dir
at org.apache.hadoop.conf.Configuration.getFile(Configuration.java:282)
at org.apache.hadoop.mapred.JobConf.getLocalFile(JobConf.java:127)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:391)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:383)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:270)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:336)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:756)
The reason for the exception is that the directory hadoop/mapred/local has "wrong" owner, thus the task tracker cannot access to it.
This caused the task tracker stucked into the following loops:
while (running) {
boolean staleState = false;
try {
// This while-loop attempts reconnects if we get network errors
while (running && ! staleState) {
try {
if (offerService() == STALE_STATE)
} catch (Exception ex) {
LOG.log(Level.INFO, "Lost connection to JobTracker [" + jobTrackAddr + "]. Retrying...", ex);
try
catch (InterruptedException ie) {
}
}
}
} finally
LOG.info("Reinitializing local state");
initialize();
}
Issue 1:
Method offerService() must catch and handle the exceptions that may be thrown from new TaskInProgress() call, and report back to the job tracker if it cannot run the task. This way, the task can be assigned to other task tracker.
Issue 2:
The taskTracker should check whether it can access to the local dir at the initialization time, before taking any tasks.
Runping
Attachments
Attachments
Issue Links
- relates to
-
HADOOP-137 Different TaskTrackers may get the same task tracker id, thus cause many problems.
-
- Closed
-