[HDFS-15761] Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately - ASF JIRA

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available

Description

To decommission a dead DN, the complete logic should be
Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED

Currently logic:

If a DN is already dead when DECOMMISSIONING starts, it becomes DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.

This logic is introduced by ~~HDFS-7374~~ which is made because of ~~HDFS-6791~~.

~~HDFS-6791~~ keeps the node in DECOMMISSION_INPROGRESS state if the node becomes dead during decommission, which could possibly make a dead DN in DECOMMISSION_INPROGRESS forever, if the DN could never be alive.

However, putting a dead DN to DECOMMISSIONED directly is not secure. For example, 3 DN of the same block are dead at the same time, then the administrator wants to decommission them unintentionally. Namenode should check first before transit them to DECOMMISSIONED. Otherwise, it would be a data loss.

In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The administrator needs to do some manual intervention, either repair the dead machine or service or recover the data before take action on them.

This change is to add Dead, DECOMMISSION_INPROGRESS back.
1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
2. NN checks pendingReplicationBlocksCount and underReplicatedBlocksCount are both 0.
3. Transit the dead DN to DECOMMISSIONED.

2 is implemented by ~~HDFS-7409~~, which adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to DECOMMISSIONED state if all files on the filesystem are fully-replicated.

Attachments

Issue Links

is caused by

HDFS-7374 Allow decommissioning of dead DataNodes

Closed

relates to

HDFS-6791 A block could remain under replicated if all of its replicas are on decommissioned nodes

Closed

HDFS-7725 Incorrect "nodes in service" metrics caused all writes to fail

Closed

requires

HDFS-7409 Allow dead nodes to finish decommissioning if all files are fully replicated

Closed

links to

GitHub Pull Request #2588

Activity

Ascending order - Click to sort in descending order

Ye Ni added a comment - 04/Jan/21 19:43 - edited

cc mingma, andrew.wang, zhz ,elgoiri

Ye Ni added a comment - 04/Jan/21 19:43 - edited cc mingma , andrew.wang , zhz , elgoiri

Íñigo Goiri added a comment - 05/Jan/21 17:42

Do we have a test that makes sure that we go to DECOMMISSION_INPROGRESS and then into DECOMMISSIONED?

Íñigo Goiri added a comment - 05/Jan/21 17:42 Do we have a test that makes sure that we go to DECOMMISSION_INPROGRESS and then into DECOMMISSIONED?

Ye Ni added a comment - 05/Jan/21 19:09

elgoiri Yes, TestDecommissioningStatus.java line 434, 443, 496 and 497.

Ye Ni added a comment - 05/Jan/21 19:09 elgoiri Yes, TestDecommissioningStatus.java line 434, 443, 496 and 497.

People

Assignee:: Ye Ni

Reporter:: Ye Ni

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Jan/21 19:29

Updated:: 05/Feb/21 09:46

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m

Hadoop HDFS