Issue Details (XML | Word | Printable)

Key: HADOOP-2393
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Amareshwari Sriramadasu
Reporter: Joydeep Sen Sarma
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

TaskTracker locks up removing job files within a synchronized method

Created: 10/Dec/07 05:58 PM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: 0.14.4
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch-2393.txt 2008-06-06 06:19 AM Amareshwari Sriramadasu 6 kB
Text File Licensed for inclusion in ASF works patch-2393.txt 2008-06-05 11:17 AM Amareshwari Sriramadasu 6 kB
Text File Licensed for inclusion in ASF works patch-2393.txt 2008-06-05 07:48 AM Amareshwari Sriramadasu 5 kB
Text File Licensed for inclusion in ASF works patch-2393.txt 2008-06-02 10:09 AM Amareshwari Sriramadasu 4 kB
Environment:
0.13.1, quad-code x86-64, FC-linux. -xmx2048
ipc.client.timeout = 10000

Hadoop Flags: Reviewed
Resolution Date: 06/Jun/08 10:48 AM


 Description  « Hide
we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.

Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.

tasktracker log:

// notice the good 10+ second gap in logs on otherwise busy node:
2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
24 active threads

... huge stack trace dump in logfile ...

something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:

2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
discarded for being too old (21380)
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
discarded for being too old (21380)
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
discarded for being too old (10367)
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
discarded for being too old (10360)
2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error

looking at code, failure of client to ping causes termination:

else { // send ping taskFound = umbilical.ping(taskId); }
...
catch (Throwable t) {
LOG.info("Communication exception: " + StringUtils.stringifyException(t));
remainingRetries -=1;
if (remainingRetries == 0) {
ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
LOG.warn("Last retry, killing "+taskId);
System.exit(65);

exit code is 65 as reported by task tracker.

i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
No work has yet been logged on this issue.