Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: QuorumJournalManager (HDFS-3077)
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Currently, the JournalNodes automatically format themselves when a new writer takes over, if they don't have any data for that namespace. However, this has a few problems:
      1) if the administrator accidentally points a new NN at the wrong quorum (eg corresponding to another cluster), it will auto-format a directory on those nodes. This doesn't cause any data loss, but would be better to bail out with an error indicating that they need to be formatted.
      2) if a journal node crashes and needs to be reformatted, it should be able to re-join the cluster and start storing new segments without having to fail over to a new NN.
      3) if 2/3 JNs get accidentally reformatted (eg the mount point becomes undone), and the user starts the NN, it should fail to start, because it may end up missing edits. If it auto-formats in this case, the user might have silent "rollback" of the most recent edits.

        Activity

        Hide
        Todd Lipcon added a comment -

        I'll propose the following design:

        • the "hdfs namenode -format" command should take confirmation and then format all of the underlying JNs, regardless of whether they currently have data.
        • at startup, and at the beginning of each log segment, if a quorum of JNs are formatted, but a minority are not, the NN should auto-format the minority. This allows an admin to replace a dead JN without taking any downtime or running any "format" command. He or she simply re-starts the dead node with a fresh disk, or reassigns a VIP/CNAME to some new node.
        • at startup, if the majority of JNs are unformatted, the NN should refuse to start up, because it may result in missing edits. This would require manual intervention, for now, if the admin really wants to start up despite the potential data loss (eg rsyncing one JN's directories to one of the fresh nodes). A future enhancement would be to automate this "unsafe startup" process.
        • for the HA use case, the "initializeSharedEdits" function would take care of formatting the JNs.

        The above proposed behavior is based on what we currently do with storage directories: if one is formatted and another is not, we will auto-format the empty one. If none are formatted, we require an explicit format step.

        Show
        Todd Lipcon added a comment - I'll propose the following design: the "hdfs namenode -format" command should take confirmation and then format all of the underlying JNs, regardless of whether they currently have data. at startup, and at the beginning of each log segment, if a quorum of JNs are formatted, but a minority are not, the NN should auto-format the minority. This allows an admin to replace a dead JN without taking any downtime or running any "format" command. He or she simply re-starts the dead node with a fresh disk, or reassigns a VIP/CNAME to some new node. at startup, if the majority of JNs are unformatted, the NN should refuse to start up, because it may result in missing edits. This would require manual intervention, for now, if the admin really wants to start up despite the potential data loss (eg rsyncing one JN's directories to one of the fresh nodes). A future enhancement would be to automate this "unsafe startup" process. for the HA use case, the "initializeSharedEdits" function would take care of formatting the JNs. The above proposed behavior is based on what we currently do with storage directories: if one is formatted and another is not, we will auto-format the empty one. If none are formatted, we require an explicit format step.
        Hide
        Aaron T. Myers added a comment -

        +1, this design makes sense to me.

        Show
        Aaron T. Myers added a comment - +1, this design makes sense to me.
        Hide
        Andrew Purtell added a comment -

        Not sure about the notion of automating an "unsafe startup" in the case the majority of JNs are unformatted. What if instead, it's possible to start up the NN in recovery mode and have it interactively suggest actions including initializing the unformatted JNs? Could summarize the most recent txn (or a few txns) of the available logs before asking which txid to choose as latest?

        Show
        Andrew Purtell added a comment - Not sure about the notion of automating an "unsafe startup" in the case the majority of JNs are unformatted. What if instead, it's possible to start up the NN in recovery mode and have it interactively suggest actions including initializing the unformatted JNs? Could summarize the most recent txn (or a few txns) of the available logs before asking which txid to choose as latest?
        Hide
        Jian Fang added a comment -

        Any progress on this Jira? The JN auto format is critical to run hadoop as a service in a cloud environment because there is no system admin to work on the replacement node and everything is automated. If the patch is not available now, what would be the workaround for this issue in an automatic fashion?

        Show
        Jian Fang added a comment - Any progress on this Jira? The JN auto format is critical to run hadoop as a service in a cloud environment because there is no system admin to work on the replacement node and everything is automated. If the patch is not available now, what would be the workaround for this issue in an automatic fashion?
        Hide
        Todd Lipcon added a comment -

        No progress from my side. This didn't turn out to be a highly requested feature in our customer base, so haven't worked on it, and lately I'm focusing on some other projects.

        Show
        Todd Lipcon added a comment - No progress from my side. This didn't turn out to be a highly requested feature in our customer base, so haven't worked on it, and lately I'm focusing on some other projects.
        Hide
        Jian Fang added a comment -

        Unfortunately, this feature is critical for Hadoop providers in cloud. HA does not really work without this feature because a node in cloud, for example, EC2 instance, could be gone at any time and a replacement will be launched. The new JN on the replacement instance must be configured automatically. Could you please provide us some guides on how to implement this feature? Thanks a lot.

        Show
        Jian Fang added a comment - Unfortunately, this feature is critical for Hadoop providers in cloud. HA does not really work without this feature because a node in cloud, for example, EC2 instance, could be gone at any time and a replacement will be launched. The new JN on the replacement instance must be configured automatically. Could you please provide us some guides on how to implement this feature? Thanks a lot.
        Hide
        Todd Lipcon added a comment -

        See my comment above on a proposed design. I'd probably do something like that.

        Show
        Todd Lipcon added a comment - See my comment above on a proposed design. I'd probably do something like that.
        Hide
        Jian Fang added a comment -

        I was working on other things and now come back to this JIRA again.

        In my use case, I care more about a replacement JN if one EC2 instance where a JN was running was gone. I looked at the format() API, seems the required information to format a JN is NamespaceInfo, however, such information could be obtained from a running name node by running a separate command line because the directory is locked by name node. Also, the list of IPCLoggerChannelsin in QJM needs to be updated if we don't restart name node. This makes me think of using HADOOP-7001 support for QJM to call the format() API if it is aware of new JNs are introduced in the hadoop configuration. The running QJM has the NamespaceInfo object in memory and it could update the list of IPCLoggerChannels as well if the new JNs are formatted successfully.

        Does this idea make sense at all?

        Thanks.

        Show
        Jian Fang added a comment - I was working on other things and now come back to this JIRA again. In my use case, I care more about a replacement JN if one EC2 instance where a JN was running was gone. I looked at the format() API, seems the required information to format a JN is NamespaceInfo, however, such information could be obtained from a running name node by running a separate command line because the directory is locked by name node. Also, the list of IPCLoggerChannelsin in QJM needs to be updated if we don't restart name node. This makes me think of using HADOOP-7001 support for QJM to call the format() API if it is aware of new JNs are introduced in the hadoop configuration. The running QJM has the NamespaceInfo object in memory and it could update the list of IPCLoggerChannels as well if the new JNs are formatted successfully. Does this idea make sense at all? Thanks.
        Hide
        Jian Fang added a comment -

        I meant we cannot run "initializeSharedEdits" command to format a new replacement JN (or any JNs at all) when the name node was running because the directory was locked and we saw the following exception:

        ERROR namenode.NameNode: Could not initialize shared edits dir
        java.io.IOException: Cannot lock storage /var/lib/hadoop/dfs-name. The directory is already locked.

        As a result, it should be the QJM's responsibility to detect the changes from configuration by using HADOOP-7001 at run time and format the new JNs properly. If this really works, perhaps you don't need rolling restart of JNs any more if they don't need to communicate with each other to make decisions like zookeeper instances. If I understand correctly, the Quorum Journal protocol only implemented the log replication part of Paxos, right?

        Show
        Jian Fang added a comment - I meant we cannot run "initializeSharedEdits" command to format a new replacement JN (or any JNs at all) when the name node was running because the directory was locked and we saw the following exception: ERROR namenode.NameNode: Could not initialize shared edits dir java.io.IOException: Cannot lock storage /var/lib/hadoop/dfs-name. The directory is already locked. As a result, it should be the QJM's responsibility to detect the changes from configuration by using HADOOP-7001 at run time and format the new JNs properly. If this really works, perhaps you don't need rolling restart of JNs any more if they don't need to communicate with each other to make decisions like zookeeper instances. If I understand correctly, the Quorum Journal protocol only implemented the log replication part of Paxos, right?
        Hide
        Jian Fang added a comment -

        I mean the rolling restart for configuration changes, not for upgrade. For the latter case, perhaps we still need rolling restart.

        Show
        Jian Fang added a comment - I mean the rolling restart for configuration changes, not for upgrade. For the latter case, perhaps we still need rolling restart.

          People

          • Assignee:
            Unassigned
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:

              Development