[HADOOP-1874] lost task trackers -- jobs hang - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.15.0
Fix Version/s: 0.15.0
Component/s: None
Labels:
None

Description

This happens on a 1400 node cluster using a recent nightly build patched with ~~HADOOP-1763~~ (that fixes a previous 'lost task tracker' issue) running a c++-pipes job with 4200 maps and 2800 reduces. The task trackers start to get lost in high numbers at the end of job completion.

Similar non-pipes job do not show the same problem, but is unclear whether it is related to c++-pipes. It could also be dfs overload when reduce tasks close and validate all newly created dfs files. I see dfs client rpc timeout exception. But this alone does not explain the escalation in losing task trackers.

I also noticed that the job tracker becomes rather unresponsive with rpc timeout and call queue overflow exceptions. Job Tracker is running with 60 handlers.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1874.new.patch
10/Oct/07 08:00
29 kB
Devaraj Das
1874.new.patch
09/Oct/07 04:54
29 kB
Devaraj Das
1874.patch
03/Oct/07 12:35
25 kB
Devaraj Das
lazy-dfs-ops.1.patch
13/Sep/07 23:44
18 kB
Devaraj Das
lazy-dfs-ops.2.patch
15/Sep/07 01:34
21 kB
Devaraj Das
lazy-dfs-ops.4.patch
18/Sep/07 18:55
18 kB
Devaraj Das
lazy-dfs-ops.patch
13/Sep/07 18:36
15 kB
Devaraj Das
server-throttle-hack.patch
14/Sep/07 19:26
1 kB
Raghu Angadi

Issue Links

relates to

HADOOP-1942 Increase the concurrency of transaction logging to edits log

Closed

Activity

People

Assignee:: Devaraj Das

Reporter:: Christian Kunz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Sep/07 05:27

Updated:: 08/Jul/09 16:52

Resolved:: 10/Oct/07 09:34