Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
-
Description
Since we fixed HDFS-9406, there are new cases reported from the field that similar fsimage corruption happens. We need good fsimage + editlogs to replay to reproduce the corruption. However, usually when the corruption is detected (at later NN restart), the good fsimage is already deleted.
We need to have a way to detect fsimage corruption on the spot. Currently what I think we could do is:
- after SNN creates a new fsimage, it spawn a new modified NN process (NN with some new command line args) to just load the fsimage and do nothing else.
- If the process failed, the currently running SNN will do either a) backup the fsimage + editlogs or b) no longer do checkpointing. And it need to somehow raise a flag to user that the fsimage is corrupt.
In step 2, if we do a, we need to introduce new NN->JN API to backup editlogs; if we do b, it changes SNN's behavior, and kind of not compatible.
Attachments
Issue Links
- is related to
-
HDFS-13818 Extend OIV to detect FSImage corruption
- Resolved