[HDFS-14854] Create improved decommission monitor implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: namenode
Labels:
None

Target Version/s:

3.3.0

Description

In HDFS-13157, we discovered a series of problems with the current decommission monitor implementation, such as:

Blocks are replicated sequentially disk by disk and node by node, and hence the load is not spread well across the cluster
Adding a node for decommission can cause the namenode write lock to be held for a long time.
Decommissioning nodes floods the replication queue and under replicated blocks from a future node or disk failure may way for a long time before they are replicated.
Blocks pending replication are checked many times under a write lock before they are sufficiently replicate, wasting resources

In this Jira I propose to create a new implementation of the decommission monitor that resolves these issues. As it will be difficult to prove one implementation is better than another, the new implementation can be enabled or disabled giving the option of the existing implementation or the new one.

I will attach a pdf with some more details on the design and then a version 1 patch shortly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

012_to_013_changes.diff
07/Nov/19 12:57
6 kB
Stephen O'Donnell
Decommission_Monitor_V2_001.pdf
18/Sep/19 14:13
81 kB
Stephen O'Donnell
HDFS-14854.001.patch
19/Sep/19 19:07
37 kB
Stephen O'Donnell
HDFS-14854.002.patch
20/Sep/19 16:20
37 kB
Stephen O'Donnell
HDFS-14854.003.patch
01/Oct/19 09:22
53 kB
Stephen O'Donnell
HDFS-14854.004.patch
01/Oct/19 14:04
55 kB
Stephen O'Donnell
HDFS-14854.005.patch
09/Oct/19 09:47
94 kB
Stephen O'Donnell
HDFS-14854.006.patch
09/Oct/19 10:54
94 kB
Stephen O'Donnell
HDFS-14854.007.patch
09/Oct/19 17:37
94 kB
Stephen O'Donnell
HDFS-14854.008.patch
14/Oct/19 15:23
97 kB
Stephen O'Donnell
HDFS-14854.009.patch
15/Oct/19 17:49
96 kB
Stephen O'Donnell
HDFS-14854.010.patch
16/Oct/19 09:46
96 kB
Stephen O'Donnell
HDFS-14854.011.patch
17/Oct/19 15:00
97 kB
Stephen O'Donnell
HDFS-14854.012.patch
01/Nov/19 13:41
100 kB
Stephen O'Donnell
HDFS-14854.013.patch
06/Nov/19 22:36
101 kB
Stephen O'Donnell
HDFS-14854.014.patch
07/Nov/19 22:00
102 kB
Stephen O'Donnell

Issue Links

causes

HDFS-15095 Fix accidental comment in flaky test TestDecommissioningStatus

Resolved

duplicates

HDFS-17538 Add tranfer priority queue for decommissioning datanode

Resolved

is related to

HDFS-15047 Document the new decommission monitor (HDFS-14854)

Resolved

Activity

People

Assignee:: Stephen O'Donnell

Reporter:: Stephen O'Donnell

Votes:: 1 Vote for this issue

Watchers:: 24 Start watching this issue

Dates

Created:: 18/Sep/19 14:11

Updated:: 03/Jun/24 06:51

Resolved:: 11/Dec/19 01:18