Description
I discovered this in the course of trying to implement a fix for HDFS-1505.
Per the comment for FSImage.saveNamespace(...), the algorithm for save namespace proceeds in the following order:
- rename current to lastcheckpoint.tmp for all of them,
- save image and recreate edits for all of them,
- rename lastcheckpoint.tmp to previous.checkpoint.
The problem is that step 3 occurs regardless of whether or not an error occurs for all storage directories in step 2. Upon restart, the NN will see non-existent or corrupt current directories, and no lastcheckpoint.tmp directories, and so will conclude that the storage directories are not formatted.
This issue appears to be present on both 0.22 and 0.23. This should arguably be a 0.22/0.23 blocker.
Attachments
Attachments
Issue Links
- relates to
-
HDFS-1896 Additional QA tasks for Edit Log branch
- Resolved