Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1623 High Availability Framework for HDFS NN
  3. HDFS-2794

HA: Active NN may purge edit log files before standby NN has a chance to read them

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623)
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      Given that the active NN is solely responsible for purging finalized edit log segments, and given that the active NN has no way of knowing when the standby reads edit logs, it's possible that the standby NN could fail to read all edits it needs before the active purges the files.

      1. hdfs-2794.txt
        9 kB
        Todd Lipcon
      2. hdfs-2794.txt
        9 kB
        Todd Lipcon

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-HAbranch-build #71 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/71/)
        HDFS-2794. Active NN may purge edit log files before standby NN has a chance to read them. Contributed by Todd Lipcon.

        todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1241317
        Files :

        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorageRetentionManager.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNNStorageRetentionFunctional.java
        • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNNStorageRetentionManager.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #71 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/71/ ) HDFS-2794 . Active NN may purge edit log files before standby NN has a chance to read them. Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1241317 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorageRetentionManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNNStorageRetentionFunctional.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNNStorageRetentionManager.java
        Todd Lipcon made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s HA branch (HDFS-1623) [ 12317568 ]
        Resolution Fixed [ 1 ]
        Hide
        Todd Lipcon added a comment -

        Thanks for the reviews, committed to the HA branch

        Show
        Todd Lipcon added a comment - Thanks for the reviews, committed to the HA branch
        Hide
        Hari Mankude added a comment -

        Decommisioning SBN might not be as simple as turning off the power to SBN in HA mode. For example, if the user wanted to revert back to non-HA, there should be a way for a provision for that.

        But, ok with the changes. No issues from my side, +1

        Show
        Hari Mankude added a comment - Decommisioning SBN might not be as simple as turning off the power to SBN in HA mode. For example, if the user wanted to revert back to non-HA, there should be a way for a provision for that. But, ok with the changes. No issues from my side, +1
        Hide
        Eli Collins added a comment -

        Ignore my previous comment, was on a stale page. Latest patch looks good, +1

        Show
        Eli Collins added a comment - Ignore my previous comment, was on a stale page. Latest patch looks good, +1
        Hide
        Eli Collins added a comment -

        Todd, new comment looks good. Per last comment the patch needs to be updated to fix the missing xml tags.

        Show
        Eli Collins added a comment - Todd, new comment looks good. Per last comment the patch needs to be updated to fix the missing xml tags.
        Hide
        Todd Lipcon added a comment -

        I intend to commit this later this afternoon unless there are objections - let me know if the above explanation doesn't make sense.

        Show
        Todd Lipcon added a comment - I intend to commit this later this afternoon unless there are objections - let me know if the above explanation doesn't make sense.
        Todd Lipcon made changes -
        Attachment hdfs-2794.txt [ 12513510 ]
        Hide
        Todd Lipcon added a comment -

        Oops! Fixed in new rev of patch. I ran xmllint on the xml file to be sure.

        Show
        Todd Lipcon added a comment - Oops! Fixed in new rev of patch. I ran xmllint on the xml file to be sure.
        Hide
        Eli Collins added a comment -

        Nit: the new parameter in hdfs-default.xml is missing a close tag for description.

        Show
        Eli Collins added a comment - Nit: the new parameter in hdfs-default.xml is missing a close tag for description.
        Hide
        Todd Lipcon added a comment -

        I thought a bit about that, but it would require another communication channel between the active and SB and has implications for decomissioning standbys as well.

        For example, one solution I considered was to have the SBN write a file into the shared edits dir marking the latest txnid for which it had a checkpoint. The ANN could then use that to determine what point the edit logs could be purged to. However, this was problematic for several reasons:
        1) Decommissioning an SBN becomes more complicated than just turning it off – if you just turn it off, then the active will never again purge edit logs, which seems problematic.
        2) Dropping a file in the shared edits dir breaks the "journal" abstraction - we'd need to implement a different back-channel for BK-based logging, for example.
        3) Extra code complexity, especially if in the future we want to support multiple SBNs.

        I also considered the operator perspective of consistency with other similar systems. In configuring MySQL replication, for example, the operator configures a "binary log retention period" as a number of days for which to retain older binlogs. If the slave is down for longer than this period, then it has to be re-bootstrapped with an rsync from the master.

        Given that we intend to later implement automatic bootstrapping if an SBN is started with a too-old image (HDFS-2731) that seems like a much simpler solution to the problem.
        The other advantage of the method implemented here is that other systems which want to consume edit logs probably will want higher retention as well, without the complexity of implementing a back-channel "purge" command to the NN.

        Show
        Todd Lipcon added a comment - I thought a bit about that, but it would require another communication channel between the active and SB and has implications for decomissioning standbys as well. For example, one solution I considered was to have the SBN write a file into the shared edits dir marking the latest txnid for which it had a checkpoint. The ANN could then use that to determine what point the edit logs could be purged to. However, this was problematic for several reasons: 1) Decommissioning an SBN becomes more complicated than just turning it off – if you just turn it off, then the active will never again purge edit logs, which seems problematic. 2) Dropping a file in the shared edits dir breaks the "journal" abstraction - we'd need to implement a different back-channel for BK-based logging, for example. 3) Extra code complexity, especially if in the future we want to support multiple SBNs. I also considered the operator perspective of consistency with other similar systems. In configuring MySQL replication, for example, the operator configures a "binary log retention period" as a number of days for which to retain older binlogs. If the slave is down for longer than this period, then it has to be re-bootstrapped with an rsync from the master. Given that we intend to later implement automatic bootstrapping if an SBN is started with a too-old image ( HDFS-2731 ) that seems like a much simpler solution to the problem. The other advantage of the method implemented here is that other systems which want to consume edit logs probably will want higher retention as well, without the complexity of implementing a back-channel "purge" command to the NN.
        Hide
        Hari Mankude added a comment -

        Would it be possible to have standby purge edit logs on the active based on its requirements and its state? Otherwise, any solution is going to be a guess.

        Show
        Hari Mankude added a comment - Would it be possible to have standby purge edit logs on the active based on its requirements and its state? Otherwise, any solution is going to be a guess.
        Hide
        Todd Lipcon added a comment -

        Changed the description to:

        <description>The number of extra transactions which should be retained
        beyond what is minimally necessary for a NN restart. This can be useful for
        audit purposes or for an HA setup where a remote Standby Node may have
        been offline for some time and need to have a longer backlog of retained
        edits in order to start again.
        Typically each edit is on the order of a few hundred bytes, so the default
        of 1 million edits should be on the order of hundreds of MBs or low GBs.

        Sound good?

        Show
        Todd Lipcon added a comment - Changed the description to: <description>The number of extra transactions which should be retained beyond what is minimally necessary for a NN restart. This can be useful for audit purposes or for an HA setup where a remote Standby Node may have been offline for some time and need to have a longer backlog of retained edits in order to start again. Typically each edit is on the order of a few hundred bytes, so the default of 1 million edits should be on the order of hundreds of MBs or low GBs. Sound good?
        Hide
        Eli Collins added a comment -

        I like this approach better too. Patch looks good. +1

        The comment in hdfs-default is a little unclear, it's worth mentioning that we'll retain 1M edits in addition to the current edits, so it's a million edits retained in addition to the current edits storage.

        Show
        Eli Collins added a comment - I like this approach better too. Patch looks good. +1 The comment in hdfs-default is a little unclear, it's worth mentioning that we'll retain 1M edits in addition to the current edits , so it's a million edits retained in addition to the current edits storage.
        Hide
        Todd Lipcon added a comment -

        Attached patch adds a new configuration, dfs.namenode.num.extra.edits.retained, which causes the NN to not purge a given number of edits that are older than the oldest retained local checkpoint. This seemed preferable to me than the other option discussed, which was to configure the NN to retain many more images. The reason is that even a million edits (the default) would be on the order of a few hundred MB, whereas retaining a day's worth of checkpoints might be on the order of hundreds of GB for a large cluster making frequent checkpoints.

        Retaining edits for a long period of time has some other useful applications, as well (eg a binary form of audit log).

        Show
        Todd Lipcon added a comment - Attached patch adds a new configuration, dfs.namenode.num.extra.edits.retained, which causes the NN to not purge a given number of edits that are older than the oldest retained local checkpoint. This seemed preferable to me than the other option discussed, which was to configure the NN to retain many more images. The reason is that even a million edits (the default) would be on the order of a few hundred MB, whereas retaining a day's worth of checkpoints might be on the order of hundreds of GB for a large cluster making frequent checkpoints. Retaining edits for a long period of time has some other useful applications, as well (eg a binary form of audit log).
        Todd Lipcon made changes -
        Attachment hdfs-2794.txt [ 12513399 ]
        Todd Lipcon made changes -
        Field Original Value New Value
        Assignee Aaron T. Myers [ atm ] Todd Lipcon [ tlipcon ]
        Hide
        Aaron T. Myers added a comment -

        Agree it's not super high priority, but we should try to improve the situation if we can. Easy things which would help would be to:

        • Automatically configure a higher minimum retention count if HA is enabled
        • Make sure to document that admins should also configure a remote fsimage dir in addition to the shared remote edits dir
        Show
        Aaron T. Myers added a comment - Agree it's not super high priority, but we should try to improve the situation if we can. Easy things which would help would be to: Automatically configure a higher minimum retention count if HA is enabled Make sure to document that admins should also configure a remote fsimage dir in addition to the shared remote edits dir
        Hide
        Todd Lipcon added a comment -

        Worth noting that this only happens if the admin explicitly invokes saveNamespace on the active node more times than the configured retention count (or restarts serveral times without running the SBN in between). So it's easy to work-around by configuring the retention count high, and in the scenario that you do hit the problem, you can simply scp any image from the active and restart the SBN.

        Show
        Todd Lipcon added a comment - Worth noting that this only happens if the admin explicitly invokes saveNamespace on the active node more times than the configured retention count (or restarts serveral times without running the SBN in between). So it's easy to work-around by configuring the retention count high, and in the scenario that you do hit the problem, you can simply scp any image from the active and restart the SBN.
        Aaron T. Myers created issue -

          People

          • Assignee:
            Todd Lipcon
            Reporter:
            Aaron T. Myers
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development