[HADOOP-442] slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: conf
Labels:
None

Description

I recently had a few nodes go bad, such that they were inaccessible to ssh, but were still running their java processes.
tasks that executed on them were failing, causing jobs to fail.
I couldn't stop the java processes, because of the ssh issue, so I was helpless until I could actually power down these nodes.
restarting the cluster doesn't help, even when removing the bad nodes from the slaves file - they just reconnect and are accepted.
while we plan to avoid tasks from launching on the same nodes over and over, what I'd like is to be able to prevent rogue processes from connecting to the masters.
Ideally, the slaves file will contain an 'exclude' section, which will list nodes that shouldn't be accessed, and should be ignored if they try to connect. That would also help in configuring the slaves file for a large cluster - I'd list the full range of machines in the cluster, then list the ones that are down in the 'exclude' section

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-442-10.patch
15/Feb/07 00:20
45 kB
Wendy Chien
hadoop-442-11.patch
16/Feb/07 21:25
43 kB
Wendy Chien
hadoop-442-8.patch
07/Feb/07 03:55
36 kB
Wendy Chien

Issue Links

relates to

HDFS-134 premature end-of-decommission of datanodes

Resolved

Activity

People

Assignee:: Wendy Chien

Reporter:: Yoram Arnon

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 10/Aug/06 07:42

Updated:: 20/May/08 22:32

Resolved:: 21/Feb/07 20:12