Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Currently NNStorageRetentionManager will simply skip a storage directory if a problem is detected. Since checkpoint saving does not go through the same set of checks, this can lead to a space exhaustion seen in HDFS-11714.
Instead of ignoring errors, it should handle it properly. One potential improvement is to catch the exception and report the storage directory failure using NNStorage.reportErrorsOnDirectories(). attemptRestoreRemovedStorage() will need extra checks. E.g. existence of a VERSION file.