[HDFS-17231] HA: Safemode should exit when resources are from low to available - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.4, 3.3.6
Fix Version/s: 3.4.0
Component/s: ha
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

The NameNodeResourceMonitor automatically enters safe mode when it detects that the resources are not sufficient. When zkfc detects insufficient resources, it triggers failover. Consider the following scenario:

Initially, nn01 is active and nn02 is standby. Due to insufficient resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource issue and puts nn01 into safemode. Subsequently, zkfc triggers failover.
At this point, nn01 is in safemode (ON) and standby, while nn02 is in safemode (OFF) and active.
After a period of time, the resources in nn01's dfs.namenode.name.dir recover, causing a slight instability and triggering failover again.
Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode (OFF) and standby.
However, since nn01 is active but in safemode (ON), hdfs cannot be read from or written to.

reproduction

Increase the dfs.namenode.resource.du.reserved
Increase the ha.health-monitor.check-interval.ms can avoid directly switching to standby and stopping the NameNodeResourceMonitor thread. Instead, it is necessary to wait for the NameNodeResourceMonitor to enter safe mode before switching to standby.
On the nn01 active node, using the dd command to create a file that exceeds the threshold, triggering a low on available disk space condition.
If the nn01 namenode process is not dead, the situation of nn01 safemode (ON) and standby occurs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
20/Oct/23 03:17
125 kB
kuper

Issue Links

links to

GitHub Pull Request #6207

Activity

People

Assignee:: kuper

Reporter:: kuper

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Oct/23 04:07

Updated:: 28/Jan/24 01:21

Resolved:: 25/Oct/23 03:44