Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.4-alpha, 3.0.0-alpha1
-
None
-
Reviewed
Description
This is a really rare error that can hit if a NN shutdown happens during the checkpointing process.
Checkpointing and restarting nominally looks like this:
- FSImage is written to a tmp file and then renamed
- MD5 file is written to a tmp file and then renamed
- NN is killed and restarted
- NN scans storage directories and picks up the renamed image file
- NN validates that the image file matches its md5 file
If the NN is killed before step 2 completes, this is what happens:
- FSImage is written to a tmp file and then renamed
- NN is killed and restarted (no MD5 file!)
- NN scans storage directories and picks up the renamed image file
- Since there's no matching MD5 file, NN errors out with a checksum error
I think we can fix this by inverting the order of writing the image then md5, or inverting the order of reading the image then md5.
Attachments
Attachments
Issue Links
- duplicates
-
HDFS-3736 Failure in starting NN due to fsimage loading failure
- Resolved