Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10659

Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.0
    • Fix Version/s: 3.3.0, 3.2.1, 3.1.3
    • Component/s: ha, journal-node
    • Labels:
      None

      Description

      In my environment I am seeing Namenodes crashing down after majority of Journalnodes are re-installed. We manage multiple clusters and do rolling upgrades followed by rolling re-install of each node including master(NN, JN, RM, ZK) nodes. When a journal node is re-installed or moved to a new disk/host, instead of running "initializeSharedEdits" command, I copy VERSION file from one of the other Journalnode and that allows my NN to start writing data to the newly installed Journalnode.

      To acheive quorum for JN and recover unfinalized segments NN during starupt creates NNNN.tmp files under "<disk>/jn/current/paxos" directory . In current implementation "paxos" directry is only created during "initializeSharedEdits" command and if a JN is re-installed the "paxos" directory is not created upon JN startup or by NN while writing NNNN.tmp files which causes NN to crash with following error message:

      192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No such file or directory)
              at java.io.FileOutputStream.open(Native Method)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
              at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
              at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
              at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
              at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
      

      The current getPaxosFile method simply returns a path to a file under "paxos" directory without verifiying its existence. Since "paxos" directoy holds files that are required for NN recovery and acheiving JN quorum my proposed solution is to add a check to "getPaxosFile" method and create the "paxos" directory if it is missing.

        Attachments

        1. HDFS-10659.006.patch
          4 kB
          star
        2. HDFS-10659.005.patch
          3 kB
          star
        3. HDFS-10659.004.patch
          6 kB
          Hanisha Koneru
        4. HDFS-10659.003.patch
          39 kB
          Hanisha Koneru
        5. HDFS-10659.002.patch
          36 kB
          Hanisha Koneru
        6. HDFS-10659.001.patch
          37 kB
          Hanisha Koneru
        7. HDFS-10659.000.patch
          52 kB
          Hanisha Koneru

          Issue Links

            Activity

              People

              • Assignee:
                starphin star
                Reporter:
                aanand001c Amit Anand
              • Votes:
                0 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: