Description
The current failed.volumes.tolerated behavior is not user friendly, datanodes can be configured to tolerate N volume failures and still offer service, but if the cluster is restarted all the datanodes with failed volumes will not start unless the failed volumes have been removed from the hdfs configuration files on the respective hosts.
The failed.volumes.tolerated configuration option should be respected on startup. The datanode should only refuse to startup if more than failed.volumes.tolerated (HDFS-1161) have failed, or if a configured critical volume (HDFS-1848) has failed (which is probably not an issue in practice since dn startup probably fails eg if the root volume has gone readonly).
Attachments
Issue Links
- duplicates
-
HDFS-1592 Datanode startup doesn't honor volumes.tolerated
- Closed
- is part of
-
HDFS-2137 Datanode Disk Fail Inplace
- Resolved
- is related to
-
HDFS-1158 HDFS-457 increases the chances of losing blocks
- Resolved
-
HDFS-1847 Datanodes should decomission themselves on volume failure
- Open
- relates to
-
HDFS-1848 Datanodes should shutdown when a critical volume fails
- Open