Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: QuorumJournalManager (HDFS-3077)
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Currently, the JournalNodes automatically format themselves when a new writer takes over, if they don't have any data for that namespace. However, this has a few problems:
      1) if the administrator accidentally points a new NN at the wrong quorum (eg corresponding to another cluster), it will auto-format a directory on those nodes. This doesn't cause any data loss, but would be better to bail out with an error indicating that they need to be formatted.
      2) if a journal node crashes and needs to be reformatted, it should be able to re-join the cluster and start storing new segments without having to fail over to a new NN.
      3) if 2/3 JNs get accidentally reformatted (eg the mount point becomes undone), and the user starts the NN, it should fail to start, because it may end up missing edits. If it auto-formats in this case, the user might have silent "rollback" of the most recent edits.

        Activity

        Hide
        Todd Lipcon added a comment -

        I'll propose the following design:

        • the "hdfs namenode -format" command should take confirmation and then format all of the underlying JNs, regardless of whether they currently have data.
        • at startup, and at the beginning of each log segment, if a quorum of JNs are formatted, but a minority are not, the NN should auto-format the minority. This allows an admin to replace a dead JN without taking any downtime or running any "format" command. He or she simply re-starts the dead node with a fresh disk, or reassigns a VIP/CNAME to some new node.
        • at startup, if the majority of JNs are unformatted, the NN should refuse to start up, because it may result in missing edits. This would require manual intervention, for now, if the admin really wants to start up despite the potential data loss (eg rsyncing one JN's directories to one of the fresh nodes). A future enhancement would be to automate this "unsafe startup" process.
        • for the HA use case, the "initializeSharedEdits" function would take care of formatting the JNs.

        The above proposed behavior is based on what we currently do with storage directories: if one is formatted and another is not, we will auto-format the empty one. If none are formatted, we require an explicit format step.

        Show
        Todd Lipcon added a comment - I'll propose the following design: the "hdfs namenode -format" command should take confirmation and then format all of the underlying JNs, regardless of whether they currently have data. at startup, and at the beginning of each log segment, if a quorum of JNs are formatted, but a minority are not, the NN should auto-format the minority. This allows an admin to replace a dead JN without taking any downtime or running any "format" command. He or she simply re-starts the dead node with a fresh disk, or reassigns a VIP/CNAME to some new node. at startup, if the majority of JNs are unformatted, the NN should refuse to start up, because it may result in missing edits. This would require manual intervention, for now, if the admin really wants to start up despite the potential data loss (eg rsyncing one JN's directories to one of the fresh nodes). A future enhancement would be to automate this "unsafe startup" process. for the HA use case, the "initializeSharedEdits" function would take care of formatting the JNs. The above proposed behavior is based on what we currently do with storage directories: if one is formatted and another is not, we will auto-format the empty one. If none are formatted, we require an explicit format step.
        Hide
        Aaron T. Myers added a comment -

        +1, this design makes sense to me.

        Show
        Aaron T. Myers added a comment - +1, this design makes sense to me.
        Hide
        Andrew Purtell added a comment -

        Not sure about the notion of automating an "unsafe startup" in the case the majority of JNs are unformatted. What if instead, it's possible to start up the NN in recovery mode and have it interactively suggest actions including initializing the unformatted JNs? Could summarize the most recent txn (or a few txns) of the available logs before asking which txid to choose as latest?

        Show
        Andrew Purtell added a comment - Not sure about the notion of automating an "unsafe startup" in the case the majority of JNs are unformatted. What if instead, it's possible to start up the NN in recovery mode and have it interactively suggest actions including initializing the unformatted JNs? Could summarize the most recent txn (or a few txns) of the available logs before asking which txid to choose as latest?
        Hide
        Jian Fang added a comment -

        Any progress on this Jira? The JN auto format is critical to run hadoop as a service in a cloud environment because there is no system admin to work on the replacement node and everything is automated. If the patch is not available now, what would be the workaround for this issue in an automatic fashion?

        Show
        Jian Fang added a comment - Any progress on this Jira? The JN auto format is critical to run hadoop as a service in a cloud environment because there is no system admin to work on the replacement node and everything is automated. If the patch is not available now, what would be the workaround for this issue in an automatic fashion?
        Hide
        Todd Lipcon added a comment -

        No progress from my side. This didn't turn out to be a highly requested feature in our customer base, so haven't worked on it, and lately I'm focusing on some other projects.

        Show
        Todd Lipcon added a comment - No progress from my side. This didn't turn out to be a highly requested feature in our customer base, so haven't worked on it, and lately I'm focusing on some other projects.
        Hide
        Jian Fang added a comment -

        Unfortunately, this feature is critical for Hadoop providers in cloud. HA does not really work without this feature because a node in cloud, for example, EC2 instance, could be gone at any time and a replacement will be launched. The new JN on the replacement instance must be configured automatically. Could you please provide us some guides on how to implement this feature? Thanks a lot.

        Show
        Jian Fang added a comment - Unfortunately, this feature is critical for Hadoop providers in cloud. HA does not really work without this feature because a node in cloud, for example, EC2 instance, could be gone at any time and a replacement will be launched. The new JN on the replacement instance must be configured automatically. Could you please provide us some guides on how to implement this feature? Thanks a lot.
        Hide
        Todd Lipcon added a comment -

        See my comment above on a proposed design. I'd probably do something like that.

        Show
        Todd Lipcon added a comment - See my comment above on a proposed design. I'd probably do something like that.
        Hide
        Jian Fang added a comment -

        I was working on other things and now come back to this JIRA again.

        In my use case, I care more about a replacement JN if one EC2 instance where a JN was running was gone. I looked at the format() API, seems the required information to format a JN is NamespaceInfo, however, such information could be obtained from a running name node by running a separate command line because the directory is locked by name node. Also, the list of IPCLoggerChannelsin in QJM needs to be updated if we don't restart name node. This makes me think of using HADOOP-7001 support for QJM to call the format() API if it is aware of new JNs are introduced in the hadoop configuration. The running QJM has the NamespaceInfo object in memory and it could update the list of IPCLoggerChannels as well if the new JNs are formatted successfully.

        Does this idea make sense at all?

        Thanks.

        Show
        Jian Fang added a comment - I was working on other things and now come back to this JIRA again. In my use case, I care more about a replacement JN if one EC2 instance where a JN was running was gone. I looked at the format() API, seems the required information to format a JN is NamespaceInfo, however, such information could be obtained from a running name node by running a separate command line because the directory is locked by name node. Also, the list of IPCLoggerChannelsin in QJM needs to be updated if we don't restart name node. This makes me think of using HADOOP-7001 support for QJM to call the format() API if it is aware of new JNs are introduced in the hadoop configuration. The running QJM has the NamespaceInfo object in memory and it could update the list of IPCLoggerChannels as well if the new JNs are formatted successfully. Does this idea make sense at all? Thanks.
        Hide
        Jian Fang added a comment -

        I meant we cannot run "initializeSharedEdits" command to format a new replacement JN (or any JNs at all) when the name node was running because the directory was locked and we saw the following exception:

        ERROR namenode.NameNode: Could not initialize shared edits dir
        java.io.IOException: Cannot lock storage /var/lib/hadoop/dfs-name. The directory is already locked.

        As a result, it should be the QJM's responsibility to detect the changes from configuration by using HADOOP-7001 at run time and format the new JNs properly. If this really works, perhaps you don't need rolling restart of JNs any more if they don't need to communicate with each other to make decisions like zookeeper instances. If I understand correctly, the Quorum Journal protocol only implemented the log replication part of Paxos, right?

        Show
        Jian Fang added a comment - I meant we cannot run "initializeSharedEdits" command to format a new replacement JN (or any JNs at all) when the name node was running because the directory was locked and we saw the following exception: ERROR namenode.NameNode: Could not initialize shared edits dir java.io.IOException: Cannot lock storage /var/lib/hadoop/dfs-name. The directory is already locked. As a result, it should be the QJM's responsibility to detect the changes from configuration by using HADOOP-7001 at run time and format the new JNs properly. If this really works, perhaps you don't need rolling restart of JNs any more if they don't need to communicate with each other to make decisions like zookeeper instances. If I understand correctly, the Quorum Journal protocol only implemented the log replication part of Paxos, right?
        Hide
        Jian Fang added a comment -

        I mean the rolling restart for configuration changes, not for upgrade. For the latter case, perhaps we still need rolling restart.

        Show
        Jian Fang added a comment - I mean the rolling restart for configuration changes, not for upgrade. For the latter case, perhaps we still need rolling restart.
        Hide
        Jian Fang added a comment -

        Didn't get a chance to work on this yet and come back again for this issue.

        Since HADOOP-7001 is a long way to go, I would start to fix a specific case first, i.e., QJM is able to format a new journal node after a journal node is replaced.

        My thought is to add some logic to the beginning of the following method in QuorumJournalManager

        Map<AsyncLogger, NewEpochResponseProto> createNewUniqueEpoch()
        throws IOException

        to check all available journal nodes by calling the following method.

        QuorumCall<AsyncLogger, Boolean> call =
        loggers.isFormatted();

        The call will wait for all journal nodes to response back and timeout after a given time to avoid waiting forever. If the call times out, simply ignore this call and continue the workflow in createNewUniqueEpoch(). However, if the call is successful, will check if any journal node is not formatted. If not formatted, call format(nsInfo) on this logger to format it. The nsInfo is available to QJM and I think it should be able to format the new journal node successfully.

        But I have couple questions to ask

        1) will this extra step with wait time cause any trouble for this new active QJM?
        2) would this extra step introduce a lot of overhead in normal condition without a need to format a journal node?
        3) since in our cases, we need to restart the name nodes after a new journal node is in place, the createNewUniqueEpoch() should be called once to format the new journal node. Is this assumption valid?
        4) Once a new journal node is formatted, are there any extra steps to make it sync data from other peers? Or this has already been handled by the quorum protocol?

        Thanks.

        Show
        Jian Fang added a comment - Didn't get a chance to work on this yet and come back again for this issue. Since HADOOP-7001 is a long way to go, I would start to fix a specific case first, i.e., QJM is able to format a new journal node after a journal node is replaced. My thought is to add some logic to the beginning of the following method in QuorumJournalManager Map<AsyncLogger, NewEpochResponseProto> createNewUniqueEpoch() throws IOException to check all available journal nodes by calling the following method. QuorumCall<AsyncLogger, Boolean> call = loggers.isFormatted(); The call will wait for all journal nodes to response back and timeout after a given time to avoid waiting forever. If the call times out, simply ignore this call and continue the workflow in createNewUniqueEpoch(). However, if the call is successful, will check if any journal node is not formatted. If not formatted, call format(nsInfo) on this logger to format it. The nsInfo is available to QJM and I think it should be able to format the new journal node successfully. But I have couple questions to ask 1) will this extra step with wait time cause any trouble for this new active QJM? 2) would this extra step introduce a lot of overhead in normal condition without a need to format a journal node? 3) since in our cases, we need to restart the name nodes after a new journal node is in place, the createNewUniqueEpoch() should be called once to format the new journal node. Is this assumption valid? 4) Once a new journal node is formatted, are there any extra steps to make it sync data from other peers? Or this has already been handled by the quorum protocol? Thanks.
        Hide
        Jian Fang added a comment -

        I added a formatNonFileJournalsIfNecessary() method to FSEditLog to check for unformatted shared edits and format them. It worked sometimes, but failed sometimes because this method was only called in startActiveServices(). If the new journal node did not come up quickly enough, the new active name node may fail to format the unformatted journal because it could not wait forever to get back the RPC responses in QuorumCall in the case of some existing journal nodes going down.

        If there is a convenient way to store the previous configured/formatted journal nodes so that we could derive what are the new journal nodes, then things become much easier and we could wait for a longer time for the new journal nodes and don't need to do anything once all new journal nodes are formatted successfully. But unfortunately, this is not easy without bringing in extra dependencies, which is not good.

        Another alternative is to modify the behavior of "initializeSharedEdits", for example, add an option "-newEditsOnly" so that it would only format new journal nodes and leave existing journal nodes intact. This requires that we stop name node first, call "hdfs namenode -initializeSharedEdits -newEditsOnly" to format new journal nodes, and then start up name node. The disadvantage is that we may need to come up with a new way to solve this issue again if we want to have reconfiguration without restart.

        Show
        Jian Fang added a comment - I added a formatNonFileJournalsIfNecessary() method to FSEditLog to check for unformatted shared edits and format them. It worked sometimes, but failed sometimes because this method was only called in startActiveServices(). If the new journal node did not come up quickly enough, the new active name node may fail to format the unformatted journal because it could not wait forever to get back the RPC responses in QuorumCall in the case of some existing journal nodes going down. If there is a convenient way to store the previous configured/formatted journal nodes so that we could derive what are the new journal nodes, then things become much easier and we could wait for a longer time for the new journal nodes and don't need to do anything once all new journal nodes are formatted successfully. But unfortunately, this is not easy without bringing in extra dependencies, which is not good. Another alternative is to modify the behavior of "initializeSharedEdits", for example, add an option "-newEditsOnly" so that it would only format new journal nodes and leave existing journal nodes intact. This requires that we stop name node first, call "hdfs namenode -initializeSharedEdits -newEditsOnly" to format new journal nodes, and then start up name node. The disadvantage is that we may need to come up with a new way to solve this issue again if we want to have reconfiguration without restart.
        Hide
        Jian Fang added a comment -

        I implemented the "-newEditsOnly" option and tested it on a cluster with 3 instances that ran server side daemons such as name nodes, resource managers, journal nodes, and others. I replaced two out of the three instances successfully with name node running fine by issuing the command "hdfs namenode -initializeSharedEdits -newEditsOnly" before started a name node. However, I ran into an issue when I replaced the last existing instance, i.e., replaced the original m1,m2, and m3 with m4, m5, and m6 one at a time. The name node suddenly stuck in loading fsimage.

        The error message was as follows.

        "
        2016-04-04 06:51:34,612 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode (main): Loading 1 INodes.
        2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf (main): Loaded FSImage in 0 seconds.
        2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Loaded image for txid 0 from /mnt/var/lib/hadoop/dfs-name/current/fsimage_0000000000000000000
        2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@3af6b48f expecting start txid #1
        2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Start loading edits file http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280
        2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1
        2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1
        2016-04-04 06:51:34,845 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (main): Encountered exception loading fsimage
        java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 19.
        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:837)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:692)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:814)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:798)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1491)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1557)
        "

        Seems the fsimage was messed up during the replacement. Could someone shed some light on what could possibly be wrong?

        In my "-newEditsOnly" logic, I skipped the following code snippet in the initializeSharedEdits() method because the other name node could be still running and be an active one when the above command was issued. I assumed that the new journal node could rely on the recovery protocol in QJM to sync up with other peers for the latest transactions.

        ----------------------------------------------------------------------------------------------
        NNStorage newSharedStorage = sharedEditsImage.getStorage();

        // Call Storage.format instead of FSImage.format here, since we don't
        // actually want to save a checkpoint - just prime the dirs with
        // the existing namespace info
        newSharedStorage.format(nsInfo);
        sharedEditsImage.getEditLog().formatNonFileJournals(nsInfo);

        // Need to make sure the edit log segments are in good shape to initialize
        // the shared edits dir.
        fsns.getFSImage().getEditLog().close();
        fsns.getFSImage().getEditLog().initJournalsForWrite();
        fsns.getFSImage().getEditLog().recoverUnclosedStreams();

        copyEditLogSegmentsToSharedDir(fsns, sharedEditsDirs, newSharedStorage,
        conf);
        --------------------------------------------------------------------------------------------
        In the new formatted Journal node, I did observe the segments were recovered from its peers successfully, for example.

        2016-04-11 19:06:38,894 WARN org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 0 on 8485): Storage directory /mnt/var/lib/hadoop/journal/local/data/my-cluster does not exist
        2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 1 on 8485): Formatting org.apache.hadoop.hdfs.qjournal.server.Journal@21e17c6d with namespace info: lv=-63;cid=CID-72db83e1-fe0c-418a-a861-d9525f73adc8;nsid=1427415229;c=0;bpid=BP-1127919701-10.236.146.62-1460399925301
        2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Formatting journal Storage Directory /mnt/var/lib/hadoop/journal/local/data/my-cluster with nsid: 1427415229
        2016-04-11 19:06:39,216 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Lock on /mnt/var/lib/hadoop/journal/local/data/my-cluster/in_use.lock acquired by nodename 2774@ip-10-225-14-41.ec2.internal
        2016-04-11 19:06:49,227 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Updating lastPromisedEpoch from 0 to 3 for client /10.139.37.248
        2016-04-11 19:06:49,274 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Scanning storage FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster)
        2016-04-11 19:06:49,294 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): No files in FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster)
        2016-04-11 19:06:49,356 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Prepared recovery for segment 1667: lastWriterEpoch: 0
        2016-04-11 19:06:49,545 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true: no current segment in place
        2016-04-11 19:06:49,553 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true from http://ip-10-139-37-248.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=1667&storageInfo=-63%3A1427415229%3A0%3ACID-72db83e1-fe0c-418a-a861-d9525f73adc8
        2016-04-11 19:06:50,336 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Image Transfer timeout configured to 60000 milliseconds
        2016-04-11 19:06:50,616 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Transfer took 0.25s at 4079.68 KB/s
        2016-04-11 19:06:50,632 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Accepted recovery for segment 1667: segmentState

        { startTxId: 1667 endTxId: 3104 isInProgress: true }

        acceptedInEpoch: 3
        2016-04-11 19:06:50,646 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 2 on 8485): Validating log segment /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 about to be finalized
        2016-04-11 19:06:50,807 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager (IPC Server handler 2 on 8485): Finalizing edits file /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 -> /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_0000000000000001667-0000000000000003104
        2016-04-11 19:06:50,845 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Updating lastWriterEpoch from 0 to 3 for client /10.139.37.248

        Did I miss anything that could mess up the fsimage files?

        Thanks in advance.

        Show
        Jian Fang added a comment - I implemented the "-newEditsOnly" option and tested it on a cluster with 3 instances that ran server side daemons such as name nodes, resource managers, journal nodes, and others. I replaced two out of the three instances successfully with name node running fine by issuing the command "hdfs namenode -initializeSharedEdits -newEditsOnly" before started a name node. However, I ran into an issue when I replaced the last existing instance, i.e., replaced the original m1,m2, and m3 with m4, m5, and m6 one at a time. The name node suddenly stuck in loading fsimage. The error message was as follows. " 2016-04-04 06:51:34,612 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode (main): Loading 1 INodes. 2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf (main): Loaded FSImage in 0 seconds. 2016-04-04 06:51:34,670 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Loaded image for txid 0 from /mnt/var/lib/hadoop/dfs-name/current/fsimage_0000000000000000000 2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@3af6b48f expecting start txid #1 2016-04-04 06:51:34,678 INFO org.apache.hadoop.hdfs.server.namenode.FSImage (main): Start loading edits file http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280 2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1 2016-04-04 06:51:34,683 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream (main): Fast-forwarding stream 'http://ip-10-171-94-13.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=19&storageInfo=-63%3A1708928638%3A0%3ACID-218c0d5a-8772-4a75-8ede-44f84d24f280' to transaction ID 1 2016-04-04 06:51:34,845 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (main): Encountered exception loading fsimage java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 19. at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:837) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:692) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:814) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:798) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1491) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1557) " Seems the fsimage was messed up during the replacement. Could someone shed some light on what could possibly be wrong? In my "-newEditsOnly" logic, I skipped the following code snippet in the initializeSharedEdits() method because the other name node could be still running and be an active one when the above command was issued. I assumed that the new journal node could rely on the recovery protocol in QJM to sync up with other peers for the latest transactions. ---------------------------------------------------------------------------------------------- NNStorage newSharedStorage = sharedEditsImage.getStorage(); // Call Storage.format instead of FSImage.format here, since we don't // actually want to save a checkpoint - just prime the dirs with // the existing namespace info newSharedStorage.format(nsInfo); sharedEditsImage.getEditLog().formatNonFileJournals(nsInfo); // Need to make sure the edit log segments are in good shape to initialize // the shared edits dir. fsns.getFSImage().getEditLog().close(); fsns.getFSImage().getEditLog().initJournalsForWrite(); fsns.getFSImage().getEditLog().recoverUnclosedStreams(); copyEditLogSegmentsToSharedDir(fsns, sharedEditsDirs, newSharedStorage, conf); -------------------------------------------------------------------------------------------- In the new formatted Journal node, I did observe the segments were recovered from its peers successfully, for example. 2016-04-11 19:06:38,894 WARN org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 0 on 8485): Storage directory /mnt/var/lib/hadoop/journal/local/data/my-cluster does not exist 2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 1 on 8485): Formatting org.apache.hadoop.hdfs.qjournal.server.Journal@21e17c6d with namespace info: lv=-63;cid=CID-72db83e1-fe0c-418a-a861-d9525f73adc8;nsid=1427415229;c=0;bpid=BP-1127919701-10.236.146.62-1460399925301 2016-04-11 19:06:39,170 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Formatting journal Storage Directory /mnt/var/lib/hadoop/journal/local/data/my-cluster with nsid: 1427415229 2016-04-11 19:06:39,216 INFO org.apache.hadoop.hdfs.server.common.Storage (IPC Server handler 1 on 8485): Lock on /mnt/var/lib/hadoop/journal/local/data/my-cluster/in_use.lock acquired by nodename 2774@ip-10-225-14-41.ec2.internal 2016-04-11 19:06:49,227 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Updating lastPromisedEpoch from 0 to 3 for client /10.139.37.248 2016-04-11 19:06:49,274 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): Scanning storage FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster) 2016-04-11 19:06:49,294 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 3 on 8485): No files in FileJournalManager(root=/mnt/var/lib/hadoop/journal/local/data/my-cluster) 2016-04-11 19:06:49,356 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Prepared recovery for segment 1667: lastWriterEpoch: 0 2016-04-11 19:06:49,545 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true: no current segment in place 2016-04-11 19:06:49,553 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Synchronizing log startTxId: 1667 endTxId: 3104 isInProgress: true from http://ip-10-139-37-248.ec2.internal:8480/getJournal?jid=my-cluster&segmentTxId=1667&storageInfo=-63%3A1427415229%3A0%3ACID-72db83e1-fe0c-418a-a861-d9525f73adc8 2016-04-11 19:06:50,336 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Image Transfer timeout configured to 60000 milliseconds 2016-04-11 19:06:50,616 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage (IPC Server handler 0 on 8485): Transfer took 0.25s at 4079.68 KB/s 2016-04-11 19:06:50,632 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 0 on 8485): Accepted recovery for segment 1667: segmentState { startTxId: 1667 endTxId: 3104 isInProgress: true } acceptedInEpoch: 3 2016-04-11 19:06:50,646 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 2 on 8485): Validating log segment /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 about to be finalized 2016-04-11 19:06:50,807 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager (IPC Server handler 2 on 8485): Finalizing edits file /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_inprogress_0000000000000001667 -> /mnt/var/lib/hadoop/journal/local/data/my-cluster/current/edits_0000000000000001667-0000000000000003104 2016-04-11 19:06:50,845 INFO org.apache.hadoop.hdfs.qjournal.server.Journal (IPC Server handler 4 on 8485): Updating lastWriterEpoch from 0 to 3 for client /10.139.37.248 Did I miss anything that could mess up the fsimage files? Thanks in advance.
        Hide
        Jian Fang added a comment -

        More likely the above issue was caused by some race condition in restarting name nodes and journal nodes instead of my code changes. Will create a separate JIRA to add the "-newEditsOnly" option to initializeSharedEdits and link it here later.

        Show
        Jian Fang added a comment - More likely the above issue was caused by some race condition in restarting name nodes and journal nodes instead of my code changes. Will create a separate JIRA to add the "-newEditsOnly" option to initializeSharedEdits and link it here later.

          People

          • Assignee:
            Unassigned
            Reporter:
            Todd Lipcon
          • Votes:
            1 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:

              Development