Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9406

FSImage may get corrupted after deleting snapshot

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.6.0
    • 2.8.0, 2.7.3, 3.0.0-alpha1
    • namenode
    • None
    • CentOS 6 amd64, CDH 5.4.4-1
      2xCPU: Intel(R) Xeon(R) CPU E5-2640 v3
      Memory: 32GB
      Namenode blocks: ~700_000 blocks, no HA setup

    • Reviewed

    Description

      FSImage corruption happened after HDFS snapshots were taken. Cluster was not used
      at that time.

      When namenode restarts it reported NULL pointer exception:

      15/11/07 10:01:15 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/fsimage_checker_5857/fsimage/current
      15/11/07 10:01:15 INFO namenode.FSImage: No edit log streams selected.
      15/11/07 10:01:18 INFO namenode.FSImageFormatPBINode: Loading 1370277 INodes.
      15/11/07 10:01:27 ERROR namenode.NameNode: Failed to start namenode.
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addChild(INodeDirectory.java:531)
              at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.addToParent(FSImageFormatPBINode.java:252)
              at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:202)
              at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:261)
              at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
              at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:929)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:913)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:732)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:668)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:643)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
      15/11/07 10:01:27 INFO util.ExitUtil: Exiting with status 1
      

      Corruption happened after "07.11.2015 00:15", and after that time blocks ~9300 blocks were invalidated that shouldn't be.
      After recovering FSimage I discovered that around ~9300 blocks were missing.

      I also attached log of namenode before and after corruption happened.

      Attachments

        1. HDFS-9406.001.patch
          5 kB
          Yongjun Zhang
        2. HDFS-9406.002.patch
          5 kB
          Yongjun Zhang
        3. HDFS-9406.003.patch
          8 kB
          Yongjun Zhang
        4. HDFS-9406.branch-2.7.patch
          8 kB
          Yongjun Zhang

        Issue Links

          Activity

            People

              yzhangal Yongjun Zhang
              stanislav.antic@gmail.com Stanislav Antic
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: