Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10659

Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments



      In my environment I am seeing Namenodes crashing down after majority of Journalnodes are re-installed. We manage multiple clusters and do rolling upgrades followed by rolling re-install of each node including master(NN, JN, RM, ZK) nodes. When a journal node is re-installed or moved to a new disk/host, instead of running "initializeSharedEdits" command, I copy VERSION file from one of the other Journalnode and that allows my NN to start writing data to the newly installed Journalnode.

      To acheive quorum for JN and recover unfinalized segments NN during starupt creates NNNN.tmp files under "<disk>/jn/current/paxos" directory . In current implementation "paxos" directry is only created during "initializeSharedEdits" command and if a JN is re-installed the "paxos" directory is not created upon JN startup or by NN while writing NNNN.tmp files which causes NN to crash with following error message: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No such file or directory)
              at java.io.FileOutputStream.open(Native Method)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
              at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
              at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
              at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
              at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

      The current getPaxosFile method simply returns a path to a file under "paxos" directory without verifiying its existence. Since "paxos" directoy holds files that are required for NN recovery and acheiving JN quorum my proposed solution is to add a check to "getPaxosFile" method and create the "paxos" directory if it is missing.


        1. HDFS-10659.000.patch
          52 kB
          Hanisha Koneru
        2. HDFS-10659.001.patch
          37 kB
          Hanisha Koneru
        3. HDFS-10659.002.patch
          36 kB
          Hanisha Koneru
        4. HDFS-10659.003.patch
          39 kB
          Hanisha Koneru
        5. HDFS-10659.004.patch
          6 kB
          Hanisha Koneru
        6. HDFS-10659.005.patch
          3 kB
        7. HDFS-10659.006.patch
          4 kB

        Issue Links


          This comment will be Viewable by All Users Viewable by All Users


            starphin star Assign to me
            aanand001c Amit Anand
            0 Vote for this issue
            18 Start watching this issue




                Issue deployment