Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.23.7, 2.1.0-beta, 3.0.0-alpha1
-
None
-
None
-
Reviewed
-
This change makes name node keep its internal replication queues and data node state updated in manual safe mode. This allows metrics and UI to present up-to-date information while in safe mode. The behavior during start-up safe mode is unchanged.
Description
Courtesy Karri VRK Reddy!
1. Namenode lost datanodes causing missing blocks
2. Namenode was put in safe mode
3. Datanode restarted on dead nodes
4. Waited for lots of time for the NN UI to reflect the recovered blocks.
5. Forced NN out of safe mode and suddenly, no more missing blocks anymore.
I was able to replicate this on 0.23 and trunk. I set dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate "lost" datanode. The opposite case also has problems (i.e. Datanode failing when NN is in safemode, doesn't lead to a missing blocks message)
Without the NN updating this list of missing blocks, the grid admins will not know when to take the cluster out of safemode.