Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.6.0
-
None
-
CentOS 6 amd64, CDH 5.4.4-1
2xCPU: Intel(R) Xeon(R) CPU E5-2640 v3
Memory: 32GB
Namenode blocks: ~700_000 blocks, no HA setup
-
Reviewed
Description
FSImage corruption happened after HDFS snapshots were taken. Cluster was not used
at that time.
When namenode restarts it reported NULL pointer exception:
15/11/07 10:01:15 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/fsimage_checker_5857/fsimage/current 15/11/07 10:01:15 INFO namenode.FSImage: No edit log streams selected. 15/11/07 10:01:18 INFO namenode.FSImageFormatPBINode: Loading 1370277 INodes. 15/11/07 10:01:27 ERROR namenode.NameNode: Failed to start namenode. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addChild(INodeDirectory.java:531) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.addToParent(FSImageFormatPBINode.java:252) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:202) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:261) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:929) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:913) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:732) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:668) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:643) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553) 15/11/07 10:01:27 INFO util.ExitUtil: Exiting with status 1
Corruption happened after "07.11.2015 00:15", and after that time blocks ~9300 blocks were invalidated that shouldn't be.
After recovering FSimage I discovered that around ~9300 blocks were missing.
I also attached log of namenode before and after corruption happened.
Attachments
Attachments
Issue Links
- is duplicated by
-
HDFS-9697 NN fails to restart due to corrupt fsimage caused by snapshot handling
- Resolved
- relates to
-
HDFS-13101 Yet another fsimage corruption related to snapshot
- Resolved
-
HDFS-9696 Garbage snapshot records lingering forever
- Closed