[HDFS-4721] Speed up lease/block recovery when DN fails and a block goes into recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.3-alpha
Fix Version/s: 2.1.0-beta
Component/s: namenode
Labels:
None

Hadoop Flags:

Reviewed

Description

This was observed while doing HBase WAL recovery. HBase uses append to write to its write ahead log. So initially the pipeline is setup as

DN1 --> DN2 --> DN3

This WAL needs to be read when DN1 fails since it houses the HBase regionserver for the WAL.

HBase first recovers the lease on the WAL file. During recovery, we choose DN1 as the primary DN to do the recovery even though DN1 has failed and is not heartbeating any more.

Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There are two options.
a) Ride on HDFS 3703 and if stale node detection is turned on, we do not choose stale datanodes (typically not heart beated for 20-30 seconds) as primary DN(s)
b) We sort the replicas in order of last heart beat and always pick the ones which gave the most recent heart beat

Going to the dead datanode increases lease + block recovery since the block goes into UNDER_RECOVERY state even though no one is recovering it actively. Please let me know if this makes sense. If yes, whether we should move forward with a) or b).

Thanks

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

4721-branch2.patch
26/Apr/13 17:15
23 kB
Varun Sharma
4721-trunk.patch
24/Apr/13 02:02
20 kB
Varun Sharma
4721-trunk-v2.patch
24/Apr/13 16:28
20 kB
Varun Sharma
4721-trunk-v3.patch
25/Apr/13 07:08
22 kB
Varun Sharma
4721-trunk-v4.patch
26/Apr/13 17:16
23 kB
Varun Sharma
4721-v2.patch
22/Apr/13 21:25
11 kB
Varun Sharma
4721-v3.patch
22/Apr/13 23:49
6 kB
Varun Sharma
4721-v4.patch
23/Apr/13 02:28
6 kB
Varun Sharma
4721-v5.patch
23/Apr/13 09:12
18 kB
Varun Sharma
4721-v6.patch
24/Apr/13 00:00
21 kB
Varun Sharma
4721-v7.patch
24/Apr/13 00:37
21 kB
Varun Sharma
4721-v8.patch
24/Apr/13 00:46
21 kB
Varun Sharma

Issue Links

is related to

HDFS-4796 Port HDFS-4721 'Speed up lease/block recovery when DN fails and a block goes into recovery' to branch 1

Resolved

is required by

HBASE-5843 Improve HBase MTTR - Mean Time To Recover

Closed

relates to

HBASE-8389 HBASE-8354 forces Namenode into loop with lease recovery requests

Closed

HDFS-4724 Provide API for checking whether lease is recovered or not

Resolved

HDFS-4754 Add an API in the namenode to mark a datanode as stale

Patch Available

Activity

People

Assignee:: Varun Sharma

Reporter:: Varun Sharma

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 21/Apr/13 17:17

Updated:: 27/Aug/13 22:08

Resolved:: 26/Apr/13 20:53