Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10659

Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory

    XMLWordPrintableJSON

Details

    Description

      In my environment I am seeing Namenodes crashing down after majority of Journalnodes are re-installed. We manage multiple clusters and do rolling upgrades followed by rolling re-install of each node including master(NN, JN, RM, ZK) nodes. When a journal node is re-installed or moved to a new disk/host, instead of running "initializeSharedEdits" command, I copy VERSION file from one of the other Journalnode and that allows my NN to start writing data to the newly installed Journalnode.

      To acheive quorum for JN and recover unfinalized segments NN during starupt creates NNNN.tmp files under "<disk>/jn/current/paxos" directory . In current implementation "paxos" directry is only created during "initializeSharedEdits" command and if a JN is re-installed the "paxos" directory is not created upon JN startup or by NN while writing NNNN.tmp files which causes NN to crash with following error message:

      192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No such file or directory)
              at java.io.FileOutputStream.open(Native Method)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
              at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
              at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
              at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
              at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
      

      The current getPaxosFile method simply returns a path to a file under "paxos" directory without verifiying its existence. Since "paxos" directoy holds files that are required for NN recovery and acheiving JN quorum my proposed solution is to add a check to "getPaxosFile" method and create the "paxos" directory if it is missing.

      Attachments

        1. HDFS-10659.000.patch
          52 kB
          Hanisha Koneru
        2. HDFS-10659.001.patch
          37 kB
          Hanisha Koneru
        3. HDFS-10659.002.patch
          36 kB
          Hanisha Koneru
        4. HDFS-10659.003.patch
          39 kB
          Hanisha Koneru
        5. HDFS-10659.004.patch
          6 kB
          Hanisha Koneru
        6. HDFS-10659.005.patch
          3 kB
          star
        7. HDFS-10659.006.patch
          4 kB
          star

        Issue Links

          Activity

            People

              starphin star
              aanand001c Amit Anand
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: