[HDFS-15945] DataNodes with zero capacity and zero blocks should be decommissioned immediately - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- pull-request-available

Description

Such as when there is a storage problem, DataNode capacity and block count sometimes become zero.
When we tried to decommission those DataNodes, we ran into an issue that the decommission did not complete because the NameNode had not received their first block report.

INFO  blockmanagement.DatanodeAdminManager (DatanodeAdminManager.java:startDecommission(183)) - Starting decommission of 127.0.0.1:58343 [DISK]DS-a29de094-2b19-4834-8318-76cda3bd86bf:NORMAL:127.0.0.1:58343 with 0 blocks
INFO  blockmanagement.BlockManager (BlockManager.java:isNodeHealthyForDecommissionOrMaintenance(4587)) - Node 127.0.0.1:58343 hasn't sent its first block report.
INFO  blockmanagement.DatanodeAdminDefaultMonitor (DatanodeAdminDefaultMonitor.java:check(258)) - Node 127.0.0.1:58343 isn't healthy. It needs to replicate 0 more blocks. Decommission In Progress is still in progress.

To make matters worse, even if we stopped these DataNodes afterward, they remained in a dead&decommissioning state until NameNode restarted.

I think those DataNodes should be decommissioned immediately even if NameNode hasn't recived the first block report.

Attachments

Issue Links

is superceded by

HDFS-15963 Unreleased volume references cause an infinite loop

Resolved

links to

GitHub Pull Request #2854

Activity

People

Assignee:: Takanobu Asanuma

Reporter:: Takanobu Asanuma

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Apr/21 08:07

Updated:: 17/May/21 18:26

Resolved:: 17/May/21 01:11

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: