Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-13596

NN restart fails after RollingUpgrade from 2.x to 3.x

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 3.3.0, 3.2.1, 3.1.3
    • hdfs
    • None

    Description

      After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails while replaying edit logs.

      • After NN is started with rollingUpgrade, the layoutVersion written to editLogs (before finalizing the upgrade) is the pre-upgrade layout version (so as to support downgrade).
      • When writing transactions to log, NN writes as per the current layout version. In 3.x, erasureCoding bits are added to the editLog transactions.
      • So any edit log written after the upgrade and before finalizing the upgrade will have the old layout version but the new format of transactions.
      • When NN is restarted and the edit logs are replayed, the NN reads the old layout version from the editLog file. When parsing the transactions, it assumes that the transactions are also from the previous layout and hence skips parsing the erasureCoding bits.
      • This cascades into reading the wrong set of bits for other fields and leads to NN shutting down.

      Sample error output:

      java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected length 16
       at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
       at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
       at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
       at org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
       at org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
       at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
       at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
       at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
       at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
       at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
       at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
      2018-05-17 19:10:06,522 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
      java.io.IOException: java.lang.IllegalStateException: Cannot skip to less than the current value (=16389), where newValue=16388
       at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
       at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
       at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
       at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
       at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
       at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
      Caused by: java.lang.IllegalStateException: Cannot skip to less than the current value (=16389), where newValue=16388
       at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
       at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943)
      

      Attachments

        1. HDFS-13596.001.patch
          32 kB
          Hui Fei
        2. HDFS-13596.002.patch
          33 kB
          Hui Fei
        3. HDFS-13596.003.patch
          33 kB
          Hui Fei
        4. HDFS-13596.004.patch
          42 kB
          Hui Fei
        5. HDFS-13596.005.patch
          42 kB
          Hui Fei
        6. HDFS-13596.006.patch
          42 kB
          Hui Fei
        7. HDFS-13596.007.patch
          42 kB
          Hui Fei
        8. HDFS-13596.008.patch
          24 kB
          Hui Fei
        9. HDFS-13596.009.patch
          23 kB
          Hui Fei
        10. HDFS-13596.010.patch
          23 kB
          Hui Fei

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ferhui Hui Fei
            hanishakoneru Hanisha Koneru
            Votes:
            4 Vote for this issue
            Watchers:
            58 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment