[HDFS-4754] Add an API in the namenode to mark a datanode as stale - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: hdfs-client, namenode
Labels:
- BB2015-05-TBR

Description

There is a detection of the stale datanodes in HDFS since ~~HDFS-3703~~, with a timeout, defaulted to 30s.

There are two reasons to add an API to mark a node as stale even if the timeout is not yet reached:
1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, we sometimes start the recovery before a node is marked staled. (even with reasonable settings as: stale: 20s; HBase ZK timeout: 30s
2) Some third parties could detect that a node is dead before the timeout, hence saving us the cost of retrying. An example or such hw is Arista, presented here by tsuna http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in ~~HBASE-6290~~.

As usual, even if the node is dead it can comeback before the 10 minutes limit. So I would propose to set a timebound. The API would be

namenode.markStale(String ipAddress, int port, long durationInMs);

After durationInMs, the namenode would again rely only on its heartbeat to decide.

Thoughts?

If there is no objections, and if nobody in the hdfs dev team has the time to spend some time on it, I will give it a try for branch 2 & 3.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

4754.v1.patch
22/May/13 19:18
88 kB
Nicolas Liochon
4754.v2.patch
28/May/13 14:41
91 kB
Nicolas Liochon
4754.v4.patch
07/Aug/13 08:32
26 kB
Nicolas Liochon
4754.v4.patch
06/Aug/13 18:39
26 kB
Nicolas Liochon

Issue Links

is related to

HDFS-3706 Add the possibility to mark a node as 'low priority' for writes in the DFSClient

Open

HDFS-3705 Add the possibility to mark a node as 'low priority' for read in the DFSClient

Resolved

HDFS-4721 Speed up lease/block recovery when DN fails and a block goes into recovery

Closed

is required by

HBASE-5843 Improve HBase MTTR - Mean Time To Recover

Closed

Activity

People

Assignee:: Nicolas Liochon

Reporter:: Nicolas Liochon

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 25/Apr/13 13:13

Updated:: 13/Jun/16 00:33