Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.4, 3.3.6
-
Reviewed
Description
The NameNodeResourceMonitor automatically enters safe mode when it detects that the resources are not sufficient. When zkfc detects insufficient resources, it triggers failover. Consider the following scenario:
- Initially, nn01 is active and nn02 is standby. Due to insufficient resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource issue and puts nn01 into safemode. Subsequently, zkfc triggers failover.
- At this point, nn01 is in safemode (ON) and standby, while nn02 is in safemode (OFF) and active.
- After a period of time, the resources in nn01's dfs.namenode.name.dir recover, causing a slight instability and triggering failover again.
- Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode (OFF) and standby.
- However, since nn01 is active but in safemode (ON), hdfs cannot be read from or written to.
reproduction
- Increase the dfs.namenode.resource.du.reserved
- Increase the ha.health-monitor.check-interval.ms can avoid directly switching to standby and stopping the NameNodeResourceMonitor thread. Instead, it is necessary to wait for the NameNodeResourceMonitor to enter safe mode before switching to standby.
- On the nn01 active node, using the dd command to create a file that exceeds the threshold, triggering a low on available disk space condition.
- If the nn01 namenode process is not dead, the situation of nn01 safemode (ON) and standby occurs.
Attachments
Attachments
Issue Links
- links to