Details
Description
In my environment I am seeing Namenodes crashing down after majority of Journalnodes are re-installed. We manage multiple clusters and do rolling upgrades followed by rolling re-install of each node including master(NN, JN, RM, ZK) nodes. When a journal node is re-installed or moved to a new disk/host, instead of running "initializeSharedEdits" command, I copy VERSION file from one of the other Journalnode and that allows my NN to start writing data to the newly installed Journalnode.
To acheive quorum for JN and recover unfinalized segments NN during starupt creates NNNN.tmp files under "<disk>/jn/current/paxos" directory . In current implementation "paxos" directry is only created during "initializeSharedEdits" command and if a JN is re-installed the "paxos" directory is not created upon JN startup or by NN while writing NNNN.tmp files which causes NN to crash with following error message:
192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at java.io.FileOutputStream.<init>(FileOutputStream.java:171) at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58) at org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971) at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
The current getPaxosFile method simply returns a path to a file under "paxos" directory without verifiying its existence. Since "paxos" directoy holds files that are required for NN recovery and acheiving JN quorum my proposed solution is to add a check to "getPaxosFile" method and create the "paxos" directory if it is missing.
Attachments
Attachments
Issue Links
- is blocked by
-
HDFS-4025 QJM: Sychronize past log segments to JNs that missed them
- Resolved
- is duplicated by
-
HDFS-14424 NN failover failed beacuse of lossing paxos directory
- Resolved