Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This jira tracks the changes required for configuring HA setup for namenodes.

      1. HDFS-2231.txt
        34 kB
        Suresh Srinivas
      2. HDFS-2231.txt
        45 kB
        Suresh Srinivas
      3. HDFS-2231.txt
        32 kB
        Suresh Srinivas
      4. HDFS-2231.txt
        45 kB
        Suresh Srinivas

        Activity

        Hide
        Suresh Srinivas added a comment -

        Current namenode and related configuration:

        Configuration that does not change for HA

        BackupNode related:
        DFS_NAMENODE_BACKUP_ADDRESS_KEY, DFS_NAMENODE_BACKUP_HTTP_ADDRESS_KEY, DFS_NAMENODE_BACKUP_SERVICE_RPC_ADDRESS_KEY

        Secondary namenode related
        DFS_NAMENODE_SECONDARY_HTTP_ADDRESS_KEY, DFS_SECONDARY_NAMENODE_KEYTAB_FILE_KEY, DFS_SECONDARY_NAMENODE_USER_NAME_KEY, DFS_SECONDARY_NAMENODE_KRB_HTTPS_USER_NAME_KEY

        Checkpointer related
        DFS_NAMENODE_CHECKPOINT_PERIOD_KEY, DFS_NAMENODE_CHECKPOINT_SIZE_KEY, DFS_NAMENODE_CHECKPOINT_DIR_KEY, DFS_NAMENODE_CHECKPOINT_EDITS_DIR_KEY

        Common configuration for active and standby (Set 1)

        DFS_NAMENODE_SAFEMODE_EXTENSION_KEY, DFS_NAMENODE_SAFEMODE_THRESHOLD_PCT_KEY, DFS_NAMENODE_HOSTS_KEY, DFS_NAMENODE_HOSTS_EXCLUDE_KEY, DFS_NAMENODE_DECOMMISSION_INTERVAL_KEY, DFS_NAMENODE_DECOMMISSION_NODES_PER_INTERVAL_KEY, DFS_NAMENODE_HANDLER_COUNT_KEY, DFS_NAMENODE_SERVICE_HANDLER_COUNT_KEY, DFS_NAMENODE_PLUGINS_KEY, DFS_NAMENODE_STARTUP_KEY, DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY, DFS_NAMENODE_MAX_OBJECTS_KEY, DFS_NAMENODE_UPGRADE_PERMISSION_KEY, DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY, DFS_NAMENODE_ACCESSTIME_PRECISION_KEY, DFS_NAMENODE_REPLICATION_CONSIDERLOAD_KEY, DFS_NAMENODE_REPLICATION_INTERVAL_KEY, DFS_NAMENODE_REPLICATION_MIN_KEY, DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY, DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY, DFS_NAMENODE_DELEGATION_KEY_UPDATE_INTERVAL_KEY, DFS_NAMENODE_DELEGATION_TOKEN_RENEW_INTERVAL_KEY, DFS_NAMENODE_DELEGATION_TOKEN_MAX_LIFETIME_KEY, DFS_NAMENODE_MAX_COMPONENT_LENGTH_KEY, DFS_NAMENODE_MAX_DIRECTORY_ITEMS_KEY, DFS_NAMENODE_KEYTAB_FILE_KEY, DFS_NAMENODE_USER_NAME_KEY, DFS_NAMENODE_KRB_HTTPS_USER_NAME_KEY

        Configurtion that is different for active and standby (Set 2)

        Address configuration:
        FS_DEFAULT_NAME_KEY, DFS_NAMENODE_RPC_ADDRESS_KEY, DFS_NAMENODE_SERVICE_RPC_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTPS_ADDRESS_KEY

        Storage directories (external nfs storage dir could be different for active/standby)
        DFS_NAMENODE_NAME_DIR_KEY, DFS_NAMENODE_EDITS_DIR_KEY

        Unused

        DFS_NAMENODE_NAME_DIR_RESTORE_KEY // Will remove this in a separate patch

        Configuration requirements for HA

        Terminology:

        1. NNAddress1, NNAddress2 - address of individual NNs
        2. NNActiveAddress - Where the active provides namenode service.
        3. NNStandbyAddress - Where the standby namenode service such as read-only operations.
        4. NNVIPAddress/FailoverAddress - this is the VIP address used by HA setup. This address is owned by the active namenode and NNActiveAddress is same as NNVIPAddress.

        Requirements:

        1. Existing deployments must be able to use the configuration without any change.
        2. Datanodes and client need to know either through configuration or mechanism such as ZooKeeper, both the namenodes, that is, active and standby.
        3. As much as possible the configuration for all the nodes must be the same. The special configuration required for different node types should be minimmal.

        HA solution uses VIP address

        1. System needs to be configured with three sets of addresses, NNVIPAddress, NNAddress1 and NNAddress2.
        2. Clients and datanodes use NNVIPAddress as NNActiveAddress.
        3. To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper.

        Active and Standby namenode addresses without VIP address

        1. This setup does not require NNVIPAddress.
        2. To discover NNActiveAddress and NNStandbyAddress clients and datanodes may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper.

        Proposal

        For VIP based solutions

        1. NNVIPAddress related configuration goes into configuration (Set 1 above). I propose using the existing keys: DFS_NAMENODE_RPC_ADDRESS_KEY, DFS_NAMENODE__SERVICE_RPC_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTPS_ADDRESS_KEY - Try using existing params and add new params for nn1 and nn2
        2. The active namenode uses this information to start services at appropriate addresses.

        Generic part common to both VIP and non VIP based solution:*

        How do we add both namenodes into a common configuration?
        Datanodes need to know both the namenode addresses. Doing it in a single config file enables this.

        To do this, I propse adding:
        DFS_NAMENODE_IDS (dfs.namenode.ids) and comma separated list of ids (any appropriate string). Add (Set 2) suffixed with "." + <NamenodeID>.
        The client and datanodes can read DFS_NAMENODES and use the suffix to get corresponding parameters to load.

        How does namenode know its NamenodeID and what configuration parameters to load?
        Namenode discovers its own configuration from parameter DFS_NAMENODE_ID (dfs.namenode.id). On namenodes an xml include points to a file with a parameter DFS_NAMENODE_ID with corresponding NamenodeID. On other nodes such as datanodes and client gateway machines the xml include points an empty file.

        Example:

        <property>
        <name>dfs.namenode.ids</name>
        <value>nn1, nn2</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address</name>
        <value>host1:port</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address.nn1</name>
        <value>host1:port</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address.nn2</name>
        <value>host2:port</value>
        </property>
        
        Show
        Suresh Srinivas added a comment - Current namenode and related configuration: Configuration that does not change for HA BackupNode related: DFS_NAMENODE_BACKUP_ADDRESS_KEY, DFS_NAMENODE_BACKUP_HTTP_ADDRESS_KEY, DFS_NAMENODE_BACKUP_SERVICE_RPC_ADDRESS_KEY Secondary namenode related DFS_NAMENODE_SECONDARY_HTTP_ADDRESS_KEY, DFS_SECONDARY_NAMENODE_KEYTAB_FILE_KEY, DFS_SECONDARY_NAMENODE_USER_NAME_KEY, DFS_SECONDARY_NAMENODE_KRB_HTTPS_USER_NAME_KEY Checkpointer related DFS_NAMENODE_CHECKPOINT_PERIOD_KEY, DFS_NAMENODE_CHECKPOINT_SIZE_KEY, DFS_NAMENODE_CHECKPOINT_DIR_KEY, DFS_NAMENODE_CHECKPOINT_EDITS_DIR_KEY Common configuration for active and standby (Set 1) DFS_NAMENODE_SAFEMODE_EXTENSION_KEY, DFS_NAMENODE_SAFEMODE_THRESHOLD_PCT_KEY, DFS_NAMENODE_HOSTS_KEY, DFS_NAMENODE_HOSTS_EXCLUDE_KEY, DFS_NAMENODE_DECOMMISSION_INTERVAL_KEY, DFS_NAMENODE_DECOMMISSION_NODES_PER_INTERVAL_KEY, DFS_NAMENODE_HANDLER_COUNT_KEY, DFS_NAMENODE_SERVICE_HANDLER_COUNT_KEY, DFS_NAMENODE_PLUGINS_KEY, DFS_NAMENODE_STARTUP_KEY, DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY, DFS_NAMENODE_MAX_OBJECTS_KEY, DFS_NAMENODE_UPGRADE_PERMISSION_KEY, DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY, DFS_NAMENODE_ACCESSTIME_PRECISION_KEY, DFS_NAMENODE_REPLICATION_CONSIDERLOAD_KEY, DFS_NAMENODE_REPLICATION_INTERVAL_KEY, DFS_NAMENODE_REPLICATION_MIN_KEY, DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY, DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY, DFS_NAMENODE_DELEGATION_KEY_UPDATE_INTERVAL_KEY, DFS_NAMENODE_DELEGATION_TOKEN_RENEW_INTERVAL_KEY, DFS_NAMENODE_DELEGATION_TOKEN_MAX_LIFETIME_KEY, DFS_NAMENODE_MAX_COMPONENT_LENGTH_KEY, DFS_NAMENODE_MAX_DIRECTORY_ITEMS_KEY, DFS_NAMENODE_KEYTAB_FILE_KEY, DFS_NAMENODE_USER_NAME_KEY, DFS_NAMENODE_KRB_HTTPS_USER_NAME_KEY Configurtion that is different for active and standby (Set 2) Address configuration: FS_DEFAULT_NAME_KEY, DFS_NAMENODE_RPC_ADDRESS_KEY, DFS_NAMENODE_SERVICE_RPC_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTPS_ADDRESS_KEY Storage directories (external nfs storage dir could be different for active/standby) DFS_NAMENODE_NAME_DIR_KEY, DFS_NAMENODE_EDITS_DIR_KEY Unused DFS_NAMENODE_NAME_DIR_RESTORE_KEY // Will remove this in a separate patch Configuration requirements for HA Terminology: NNAddress1, NNAddress2 - address of individual NNs NNActiveAddress - Where the active provides namenode service. NNStandbyAddress - Where the standby namenode service such as read-only operations. NNVIPAddress/FailoverAddress - this is the VIP address used by HA setup. This address is owned by the active namenode and NNActiveAddress is same as NNVIPAddress. Requirements: Existing deployments must be able to use the configuration without any change. Datanodes and client need to know either through configuration or mechanism such as ZooKeeper, both the namenodes, that is, active and standby. As much as possible the configuration for all the nodes must be the same. The special configuration required for different node types should be minimmal. HA solution uses VIP address System needs to be configured with three sets of addresses, NNVIPAddress, NNAddress1 and NNAddress2. Clients and datanodes use NNVIPAddress as NNActiveAddress. To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper. Active and Standby namenode addresses without VIP address This setup does not require NNVIPAddress. To discover NNActiveAddress and NNStandbyAddress clients and datanodes may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper. Proposal For VIP based solutions NNVIPAddress related configuration goes into configuration (Set 1 above). I propose using the existing keys: DFS_NAMENODE_RPC_ADDRESS_KEY, DFS_NAMENODE__SERVICE_RPC_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTPS_ADDRESS_KEY - Try using existing params and add new params for nn1 and nn2 The active namenode uses this information to start services at appropriate addresses. Generic part common to both VIP and non VIP based solution:* How do we add both namenodes into a common configuration? Datanodes need to know both the namenode addresses. Doing it in a single config file enables this. To do this, I propse adding: DFS_NAMENODE_IDS (dfs.namenode.ids) and comma separated list of ids (any appropriate string). Add (Set 2) suffixed with "." + <NamenodeID>. The client and datanodes can read DFS_NAMENODES and use the suffix to get corresponding parameters to load. How does namenode know its NamenodeID and what configuration parameters to load? Namenode discovers its own configuration from parameter DFS_NAMENODE_ID (dfs.namenode.id). On namenodes an xml include points to a file with a parameter DFS_NAMENODE_ID with corresponding NamenodeID. On other nodes such as datanodes and client gateway machines the xml include points an empty file. Example: <property> <name>dfs.namenode.ids</name> <value>nn1, nn2</value> </property> <property> <name>dfs.namenode.rpc-address</name> <value>host1:port</value> </property> <property> <name>dfs.namenode.rpc-address.nn1</name> <value>host1:port</value> </property> <property> <name>dfs.namenode.rpc-address.nn2</name> <value>host2:port</value> </property>
        Hide
        Allen Wittenauer added a comment -

        To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper.

        Why? Shouldn't they only care about the VIP in a real HA scenario? What is the value of the VIP if clients are going to be aware of and try both addresses anyway?

        Show
        Allen Wittenauer added a comment - To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper. Why? Shouldn't they only care about the VIP in a real HA scenario? What is the value of the VIP if clients are going to be aware of and try both addresses anyway?
        Hide
        Todd Lipcon added a comment -

        Rather than having to xmlinclude a separate file with the NN's ID, can we allow the user to just use the local hostname? eg set the default to $

        {env.hostname}

        or somesuch, and have it get substituted?

        Show
        Todd Lipcon added a comment - Rather than having to xmlinclude a separate file with the NN's ID, can we allow the user to just use the local hostname? eg set the default to $ {env.hostname} or somesuch, and have it get substituted?
        Hide
        Uma Maheswara Rao G added a comment -

        HA solution uses VIP address
        1.System needs to be configured with three sets of addresses, NNVIPAddress, NNAddress1 and NNAddress2.
        ...
        ....
        3.To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper.

        In VIP solution, putting dependancy on actual IPs may not be a good choice.
        To discover NNStandbyAddress clients and datanodes need to retry for actual stanbynode address.
        What if we put another VIPAddress for NNStandbyAddress also?

        for example:
        Clients and datanodes use standbyNNVIPAddress as NNStandbyAddress.
        So, clients need not bother about retries to find the actual address.

        Please correct me if i understud wrongly.

        --thanks

        Show
        Uma Maheswara Rao G added a comment - HA solution uses VIP address 1.System needs to be configured with three sets of addresses, NNVIPAddress, NNAddress1 and NNAddress2. ... .... 3.To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper. In VIP solution, putting dependancy on actual IPs may not be a good choice. To discover NNStandbyAddress clients and datanodes need to retry for actual stanbynode address. What if we put another VIPAddress for NNStandbyAddress also? for example: Clients and datanodes use standbyNNVIPAddress as NNStandbyAddress. So, clients need not bother about retries to find the actual address. Please correct me if i understud wrongly. --thanks
        Hide
        Rajiv Chittajallu added a comment -

        Rather than having to xmlinclude a separate file with the NN's ID, can we allow the user to just use the local hostname? eg set the default to ${env.hostname} or some such, and have it get substituted?

        Physical hosts serving the primary and backup service might change during the lifetime of the cluster. Using hostname(1) or hostid(1) might work but it not advisable, in production sites, as it might require a configuration refresh when the physical host is replaced.

        Show
        Rajiv Chittajallu added a comment - Rather than having to xmlinclude a separate file with the NN's ID, can we allow the user to just use the local hostname? eg set the default to ${env.hostname} or some such, and have it get substituted? Physical hosts serving the primary and backup service might change during the lifetime of the cluster. Using hostname(1) or hostid(1) might work but it not advisable, in production sites, as it might require a configuration refresh when the physical host is replaced.
        Hide
        Todd Lipcon added a comment -

        Another thought would be to default dfs.namenode.id to "", and if it's empty, try to discover the ID by iterating through all available dfs.namenode.rpc-address.*, checking if any of those addresses is bindable as a local interface. If exactly 1 such IP is found, we can use that ID. If 0 or more than 1 is found, we would bail out due to indeterminate ID.

        My thinking is that right now we can always ship the same configuration to all the hosts in the cluster. Having to add things which differ from machine to machine complicates deployment.

        Show
        Todd Lipcon added a comment - Another thought would be to default dfs.namenode.id to "", and if it's empty, try to discover the ID by iterating through all available dfs.namenode.rpc-address.*, checking if any of those addresses is bindable as a local interface. If exactly 1 such IP is found, we can use that ID. If 0 or more than 1 is found, we would bail out due to indeterminate ID. My thinking is that right now we can always ship the same configuration to all the hosts in the cluster. Having to add things which differ from machine to machine complicates deployment.
        Hide
        Suresh Srinivas added a comment -

        Allen - Why? Shouldn't they only care about the VIP in a real HA scenario? What is the value of the VIP if clients are going to be aware of and try both addresses anyway?

        First given VIP is interpreted in many ways - the VIP in this jira is failover IP address. HDFS-1623 does not mandate/preclude people from using IP failover. When IP failover is not used, clients will work with both the addresses.

        Uma

        Certinaly it is worth considering a VIP address for standby. It makes things much simpler. But as I said in my previous comment, a failover address may not be employed by some setups.

        Todd, Another thought would be to default dfs.namenode.id to "", and if it's empty, try to discover the ID by iterating through all available dfs.namenode.rpc-address.*, checking if any of those addresses is bindable as a local interface. If exactly 1 such IP is found, we can use that ID. If 0 or more than 1 is found, we would bail out due to indeterminate ID.

        I think this is worth considering. However, I think using logical names in the configuration keys than physical address is a better idea.

        Show
        Suresh Srinivas added a comment - Allen - Why? Shouldn't they only care about the VIP in a real HA scenario? What is the value of the VIP if clients are going to be aware of and try both addresses anyway? First given VIP is interpreted in many ways - the VIP in this jira is failover IP address. HDFS-1623 does not mandate/preclude people from using IP failover. When IP failover is not used, clients will work with both the addresses. Uma Certinaly it is worth considering a VIP address for standby. It makes things much simpler. But as I said in my previous comment, a failover address may not be employed by some setups. Todd, Another thought would be to default dfs.namenode.id to "", and if it's empty, try to discover the ID by iterating through all available dfs.namenode.rpc-address.*, checking if any of those addresses is bindable as a local interface. If exactly 1 such IP is found, we can use that ID. If 0 or more than 1 is found, we would bail out due to indeterminate ID. I think this is worth considering. However, I think using logical names in the configuration keys than physical address is a better idea.
        Hide
        Todd Lipcon added a comment -

        Sure, I agree logical names are preferable. But having an automatic mapping so you can deploy the same hdfs-site.xml to all nodes is better for manageability.

        I used to manage some MySQL active-active setups and it was always a pain to deal with changing server_id on the local config of either side of the [otherwise-identical] pair.

        Show
        Todd Lipcon added a comment - Sure, I agree logical names are preferable. But having an automatic mapping so you can deploy the same hdfs-site.xml to all nodes is better for manageability. I used to manage some MySQL active-active setups and it was always a pain to deal with changing server_id on the local config of either side of the [otherwise-identical] pair.
        Hide
        Allen Wittenauer added a comment -

        I think the general sentiment I'm left with (and I think Rajiv is sort of alluding to) is that based upon this configuration information, the architecture isn't operationally sound and/or we're trying to do too much without a real understanding of how this works in practice.

        It makes no sense to use a failover IP and have the clients know where the failover IP might fail over. Especially keep in mind HDFS-34: what will end up happening is that ops teams will end up having to eat up 5 addresses for 2 HA-NNs. (1 for the failover, 2 for each logical node, 2 for reach physical node)

        If we're trying to build both a Failover server and a Scalable service (to use SunCluster terminology), then the configuration options for those are very very different. In a failover scenario, the failover IP should be the only configuration option that clients need. In a Scalable scenario, then the gang of addresses should be the set. In other words, these options are mutually exclusive.

        Show
        Allen Wittenauer added a comment - I think the general sentiment I'm left with (and I think Rajiv is sort of alluding to) is that based upon this configuration information, the architecture isn't operationally sound and/or we're trying to do too much without a real understanding of how this works in practice. It makes no sense to use a failover IP and have the clients know where the failover IP might fail over. Especially keep in mind HDFS-34 : what will end up happening is that ops teams will end up having to eat up 5 addresses for 2 HA-NNs. (1 for the failover, 2 for each logical node, 2 for reach physical node) If we're trying to build both a Failover server and a Scalable service (to use SunCluster terminology), then the configuration options for those are very very different. In a failover scenario, the failover IP should be the only configuration option that clients need. In a Scalable scenario, then the gang of addresses should be the set. In other words, these options are mutually exclusive.
        Hide
        Suresh Srinivas added a comment -

        I think the general sentiment I'm left with (and I think Rajiv is sort of alluding to) is that based upon this configuration information, the architecture isn't operationally sound and/or we're trying to do too much without a real understanding of how this works in practice.

        I am not sure if you and Rajiv are talking about the same thing! Rajiv is providing feedback on why using hostnames to make specific keys is not such a good idea.

        > It makes no sense to use a failover IP and have the clients know where the failover IP might fail over. Especially keep in mind HDFS-34: what will end up happening is that ops teams will end up having to eat up 5 addresses for 2 HA-NNs. (1 for the failover, 2 for each logical node, 2 for reach physical node)
        This is not the question about clients. You could use failover address for active and standby and just use that from client. This could also apply to datanodes. This might be the solution that some folks go with.

        That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running.

        Show
        Suresh Srinivas added a comment - I think the general sentiment I'm left with (and I think Rajiv is sort of alluding to) is that based upon this configuration information, the architecture isn't operationally sound and/or we're trying to do too much without a real understanding of how this works in practice. I am not sure if you and Rajiv are talking about the same thing! Rajiv is providing feedback on why using hostnames to make specific keys is not such a good idea. > It makes no sense to use a failover IP and have the clients know where the failover IP might fail over. Especially keep in mind HDFS-34 : what will end up happening is that ops teams will end up having to eat up 5 addresses for 2 HA-NNs. (1 for the failover, 2 for each logical node, 2 for reach physical node) This is not the question about clients. You could use failover address for active and standby and just use that from client. This could also apply to datanodes. This might be the solution that some folks go with. That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running.
        Hide
        Allen Wittenauer added a comment -

        That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running.

        I think there is a major terminology disconnect. "failover" implies an active and a standby.

        Show
        Allen Wittenauer added a comment - That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running. I think there is a major terminology disconnect. "failover" implies an active and a standby.
        Hide
        Suresh Srinivas added a comment -

        > I think there is a major terminology disconnect. "failover" implies an active and a standby.
        Yes. Especially taken out of context. What I am talking about is IP failover.

        Show
        Suresh Srinivas added a comment - > I think there is a major terminology disconnect. "failover" implies an active and a standby. Yes. Especially taken out of context. What I am talking about is IP failover.
        Hide
        Allen Wittenauer added a comment -

        I don't think I took it out of context. You specifically said:

        That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running.

        I read this as "something which is not failover has an active and a standby that a client cares about"....which makes no sense. If a client cares about the active server, it is using a failover architecture, no matter how it is painted.

        Show
        Allen Wittenauer added a comment - I don't think I took it out of context. You specifically said: That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running. I read this as "something which is not failover has an active and a standby that a client cares about"....which makes no sense. If a client cares about the active server, it is using a failover architecture, no matter how it is painted.
        Hide
        Konstantin Shvachko added a comment -

        the architecture isn't operationally sound and/or we're trying to do too much without a real understanding of how this works in practice.

        I think you can say both at this point. Since this jira still talks about VIP and non-VIP, shared-storage and no-shared-storage solutions, which should be resolved as the implementations a very different for these approaches.
        In my prototype HDFS-2064 I did not introduce or change a single config key. Not saying it is good, just that it can be done. And concentrating on one single approach rather than building a universal plugable HA framework will simplify things.

        Show
        Konstantin Shvachko added a comment - the architecture isn't operationally sound and/or we're trying to do too much without a real understanding of how this works in practice. I think you can say both at this point. Since this jira still talks about VIP and non-VIP, shared-storage and no-shared-storage solutions, which should be resolved as the implementations a very different for these approaches. In my prototype HDFS-2064 I did not introduce or change a single config key. Not saying it is good, just that it can be done. And concentrating on one single approach rather than building a universal plugable HA framework will simplify things.
        Hide
        Suresh Srinivas added a comment -

        That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running.

        The context of this is that, the failover in the above statement is ip failover and not failover of active to standby namenodes.

        I think you can say both at this point. Since this jira still talks about VIP and non-VIP, shared-storage and no-shared-storage solutions, which should be resolved as the implementations a very different for these approaches.

        I am not sure about your conclusion here Konstantin. You want to use backup node approch with ip failover and a special rejiggered load balancer for dual block reports. Some want solution where they do not want ip failover, nor special load balancer.

        I did not introduce or change a single config key. Not saying it is good, just that it can be done. And concentrating on one single approach rather than building a universal plugable HA framework will simplify things.

        You are using backup namenode configuration keys. So it is not that you are making changes. The changes you need are already there. Also I think if you do not make changes. and overload backup configuration to mean standby, it is not clean. Various discussions have covered this already.

        Show
        Suresh Srinivas added a comment - That second scheme is not to use failover. In such a case, clients need to know about where the active is running. Datanodes needs to know about where both active and standby is running. The context of this is that, the failover in the above statement is ip failover and not failover of active to standby namenodes. I think you can say both at this point. Since this jira still talks about VIP and non-VIP, shared-storage and no-shared-storage solutions, which should be resolved as the implementations a very different for these approaches. I am not sure about your conclusion here Konstantin. You want to use backup node approch with ip failover and a special rejiggered load balancer for dual block reports. Some want solution where they do not want ip failover, nor special load balancer. I did not introduce or change a single config key. Not saying it is good, just that it can be done. And concentrating on one single approach rather than building a universal plugable HA framework will simplify things. You are using backup namenode configuration keys. So it is not that you are making changes. The changes you need are already there. Also I think if you do not make changes. and overload backup configuration to mean standby, it is not clean. Various discussions have covered this already.
        Hide
        Konstantin Shvachko added a comment -

        You want to use backup node ... Some want solution where they do not want ip failover, nor special load balancer.

        I think Allen's point is which approach are you taking and implementing. I've asked this many times. It seems that you are building a universal HA framework that would fit multiple approaches. If so that would be "trying to do too much" as Allen states. E.g. with IP failover approach you probably don't need any configuration changes.

        if you do not make changes. and overload backup configuration to mean standby, it is not clean.

        "clean" is a subjective metrics.
        The configuration parameters should be the same for Backup and Standby nodes. As these are only different roles of the single NodeNode. If some of existing parameters need to be reasonably renamed I'm for it.

        Show
        Konstantin Shvachko added a comment - You want to use backup node ... Some want solution where they do not want ip failover, nor special load balancer. I think Allen's point is which approach are you taking and implementing. I've asked this many times. It seems that you are building a universal HA framework that would fit multiple approaches. If so that would be "trying to do too much" as Allen states. E.g. with IP failover approach you probably don't need any configuration changes. if you do not make changes. and overload backup configuration to mean standby, it is not clean. "clean" is a subjective metrics. The configuration parameters should be the same for Backup and Standby nodes. As these are only different roles of the single NodeNode. If some of existing parameters need to be reasonably renamed I'm for it.
        Hide
        Rajiv Chittajallu added a comment -

        Terminology:
        1. NNAddress1, NNAddress2 - address of individual NNs
        2. NNActiveAddress - Where the active provides namenode service.
        3. NNStandbyAddress - Where the standby namenode service such as read-only operations.
        4. NNVIPAddress/FailoverAddress - this is the VIP address used by HA setup. This address is owned by the active namenode and NNActiveAddress is same as NNVIPAddress.

        From the above description, I see only 2 IP's that could moved between nodes, NNActiveAddress and NNStandbyAddress, and can be possibly called as vips.

        HA solution uses VIP address:

        1. System needs to be configured with three sets of addresses, NNVIPAddress, NNAddress1 and NNAddress2.
        2. Clients and datanodes use NNVIPAddress as NNActiveAddress.
        3. To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper.

        And the discovery part is what probably confusing every one. As Allen said, this seem to be unnecessary. If you can't rely on the VIP address, then the implementation is not correct. Clients doesn't have to know the physical host address.

        Using Zookeeper and IP failover are two different solutions to notify clients of the change in the failover of the namenode.

        Show
        Rajiv Chittajallu added a comment - Terminology: 1. NNAddress1, NNAddress2 - address of individual NNs 2. NNActiveAddress - Where the active provides namenode service. 3. NNStandbyAddress - Where the standby namenode service such as read-only operations. 4. NNVIPAddress/FailoverAddress - this is the VIP address used by HA setup. This address is owned by the active namenode and NNActiveAddress is same as NNVIPAddress. From the above description, I see only 2 IP's that could moved between nodes, NNActiveAddress and NNStandbyAddress, and can be possibly called as vips. HA solution uses VIP address: 1. System needs to be configured with three sets of addresses, NNVIPAddress, NNAddress1 and NNAddress2. 2. Clients and datanodes use NNVIPAddress as NNActiveAddress. 3. To discover NNStandbyAddress clients and datanode may try NNAddress1 and NNAddress2 or use mechanism such as Zookeeper. And the discovery part is what probably confusing every one. As Allen said, this seem to be unnecessary. If you can't rely on the VIP address, then the implementation is not correct. Clients doesn't have to know the physical host address. Using Zookeeper and IP failover are two different solutions to notify clients of the change in the failover of the namenode.
        Hide
        Suresh Srinivas added a comment -

        I've asked this many times. It seems that you are building a universal HA framework that would fit multiple approaches.

        I have answered it many times The requirements considered were:

        1. Non shared vs shared approach
        2. IP failover vs no IP failover

        These multiple approaches are considered because, some folks (working on this jira) want a solution with no special hardware, no IP failover, and want to use the shared state approach. Your preference and few other jira posts indicate preference to use BackupNode and possibly IP failover.

        If so that would be "trying to do too much" as Allen states. E.g. with IP failover approach you probably don't need any configuration changes.

        I see. How do you send block reports from datanodes to two namenodesoun

        In the prototype you have:

        1. Backup node configuration is used as place holder for second namenode, with very confusing backup node becoming active etc. We have gone over this before.
        2. All the configuration requirement discussed in this jira, you have moved it to the load balancer. It is not that you do not have it.

        The configuration parameters should be the same for Backup and Standby nodes. As these are only different roles of the single NodeNode. If some of existing parameters need to be reasonably renamed I'm for it.

        The proposal here keeps backward compatibility for configurations using backup node (not sure if any one is using it!). Second set of params are added for second namenode. I am okay renaming backup node configurations to do this, if you are confident it will not affect existing setups. Please provide a detailed design.

        Show
        Suresh Srinivas added a comment - I've asked this many times. It seems that you are building a universal HA framework that would fit multiple approaches. I have answered it many times The requirements considered were: Non shared vs shared approach IP failover vs no IP failover These multiple approaches are considered because, some folks (working on this jira) want a solution with no special hardware, no IP failover, and want to use the shared state approach. Your preference and few other jira posts indicate preference to use BackupNode and possibly IP failover. If so that would be "trying to do too much" as Allen states. E.g. with IP failover approach you probably don't need any configuration changes. I see. How do you send block reports from datanodes to two namenodesoun In the prototype you have: Backup node configuration is used as place holder for second namenode, with very confusing backup node becoming active etc. We have gone over this before. All the configuration requirement discussed in this jira, you have moved it to the load balancer. It is not that you do not have it. The configuration parameters should be the same for Backup and Standby nodes. As these are only different roles of the single NodeNode. If some of existing parameters need to be reasonably renamed I'm for it. The proposal here keeps backward compatibility for configurations using backup node (not sure if any one is using it!). Second set of params are added for second namenode. I am okay renaming backup node configurations to do this, if you are confident it will not affect existing setups. Please provide a detailed design.
        Hide
        Suresh Srinivas added a comment -

        I had used VIP address to mean failover address, which seems to have caused the confusion. Here is the second part rewritten:

        For discussion of existing configuration see the first part of - https://issues.apache.org/jira/browse/HDFS-2231?focusedCommentId=13080279&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13080279

        Configuration requirements for HA

        Terminology:

        1. NNAddress1, NNAddress2 - address of individual NNs. They could be logical addresses.
        2. NNActiveAddress - Address where the active is running. This is one of NNAddress1 or NNAddress2.
        3. NNStandbyAddress - Where the standby is running. This is one of NNAddress1 or NNAddress2.
        4. NNFailoverAddress - this is the address of the active used by HA setups that use IP failover mechanism.

        Requirements:

        1. Backward compatibility: Existing deployments must be able to use the existing configuration without any change.
        2. Datanodes and client need to know both the namenodes through configuration.
        3. As much as possible the configuration for all the nodes must be the same. The special configuration required for different node types (namenode, datanodes, gateways) should be minimmal.

        HA solution uses IP failover

        1. System needs to be configured with three sets of addresses, NNFailoverAddress, NNAddress1 and NNAddress2.
        2. To get to the active namenode, clients use NNFailoverAddress.
        3. To discover NNStandbyAddress clients and datanode may use ZooKeeper or try NNAddress1 and NNAddress2.

        Active and Standby namenode addresses without IP failover

        1. This setup does not require NNFailoverAddress.
        2. To discover NNActiveAddress and NNStandbyAddress clients and datanodes may try NNAddress1 and NNAddress2 or use Zookeeper.

        Proposal

        For solutions using IP Failover

        1. NNFailoverAddress related configuration goes into configuration (Set 1 above). I propose using the existing keys: DFS_NAMENODE_RPC_ADDRESS_KEY, DFS_NAMENODE__SERVICE_RPC_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTPS_ADDRESS_KEY

        Generic part common to both VIP and non VIP based solution:*

        How do we add both namenodes into a common configuration?
        Datanodes need to know both the namenode addresses. I propse adding:
        DFS_NAMENODE_IDS (dfs.namenode.ids) and comma separated list of ids (any appropriate string). Add (Set 2) suffixed with "." + <NamenodeID>.
        The client and datanodes can read DFS_NAMENODES and use the suffix to get corresponding parameters to use.

        How does namenode know its NamenodeID and what configuration parameters to load?
        Namenode discovers its own configuration from parameter DFS_NAMENODE_ID (dfs.namenode.id). On namenodes an xml include points to a file with a parameter DFS_NAMENODE_ID with corresponding NamenodeID. On other nodes such as datanodes and client gateway machines the xml include points an empty file. I like Todd's proposal, where a namenode when sees empty or unconfigured DFS_NAMENODE_ID, could try binding to the rpc address and when it succeeds, it discovers its NamenodeID, from suffix in the config param. (We could drop DFS_NAMENODE_ID altogether).

        Example for deployments without IP failover:
        NNAddress1 = host1:port
        NNAddress2 = host2:port

        <property>
        <name>dfs.namenode.ids</name>
        <value>nn1, nn2</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address.nn1</name>
        <value>host1:port</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address.nn2</name>
        <value>host2:port</value>
        </property>
        

        Example for deployments with IP failover:
        NNFailoverAddress = failoverAddress:port
        NNAddress1 = host1:port
        NNAddress2 = host2:port

        <property>
        <name>dfs.namenode.ids</name>
        <value>nn1, nn2</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address</name>
        <value>failoverAddress:port</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address.nn1</name>
        <value>host1:port</value>
        </property>
        <property>
        <name>dfs.namenode.rpc-address.nn2</name>
        <value>host2:port</value>
        </property>
        
        Show
        Suresh Srinivas added a comment - I had used VIP address to mean failover address, which seems to have caused the confusion. Here is the second part rewritten: For discussion of existing configuration see the first part of - https://issues.apache.org/jira/browse/HDFS-2231?focusedCommentId=13080279&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13080279 Configuration requirements for HA Terminology: NNAddress1, NNAddress2 - address of individual NNs. They could be logical addresses. NNActiveAddress - Address where the active is running. This is one of NNAddress1 or NNAddress2. NNStandbyAddress - Where the standby is running. This is one of NNAddress1 or NNAddress2. NNFailoverAddress - this is the address of the active used by HA setups that use IP failover mechanism. Requirements: Backward compatibility: Existing deployments must be able to use the existing configuration without any change. Datanodes and client need to know both the namenodes through configuration. As much as possible the configuration for all the nodes must be the same. The special configuration required for different node types (namenode, datanodes, gateways) should be minimmal. HA solution uses IP failover System needs to be configured with three sets of addresses, NNFailoverAddress, NNAddress1 and NNAddress2. To get to the active namenode, clients use NNFailoverAddress. To discover NNStandbyAddress clients and datanode may use ZooKeeper or try NNAddress1 and NNAddress2. Active and Standby namenode addresses without IP failover This setup does not require NNFailoverAddress. To discover NNActiveAddress and NNStandbyAddress clients and datanodes may try NNAddress1 and NNAddress2 or use Zookeeper. Proposal For solutions using IP Failover NNFailoverAddress related configuration goes into configuration (Set 1 above). I propose using the existing keys: DFS_NAMENODE_RPC_ADDRESS_KEY, DFS_NAMENODE__SERVICE_RPC_ADDRESS_KEY, DFS_NAMENODE_HTTP_ADDRESS_KEY, DFS_NAMENODE_HTTPS_ADDRESS_KEY Generic part common to both VIP and non VIP based solution:* How do we add both namenodes into a common configuration? Datanodes need to know both the namenode addresses. I propse adding: DFS_NAMENODE_IDS (dfs.namenode.ids) and comma separated list of ids (any appropriate string). Add (Set 2) suffixed with "." + <NamenodeID>. The client and datanodes can read DFS_NAMENODES and use the suffix to get corresponding parameters to use. How does namenode know its NamenodeID and what configuration parameters to load? Namenode discovers its own configuration from parameter DFS_NAMENODE_ID (dfs.namenode.id). On namenodes an xml include points to a file with a parameter DFS_NAMENODE_ID with corresponding NamenodeID. On other nodes such as datanodes and client gateway machines the xml include points an empty file. I like Todd's proposal, where a namenode when sees empty or unconfigured DFS_NAMENODE_ID, could try binding to the rpc address and when it succeeds, it discovers its NamenodeID, from suffix in the config param. (We could drop DFS_NAMENODE_ID altogether). Example for deployments without IP failover: NNAddress1 = host1:port NNAddress2 = host2:port <property> <name>dfs.namenode.ids</name> <value>nn1, nn2</value> </property> <property> <name>dfs.namenode.rpc-address.nn1</name> <value>host1:port</value> </property> <property> <name>dfs.namenode.rpc-address.nn2</name> <value>host2:port</value> </property> Example for deployments with IP failover: NNFailoverAddress = failoverAddress:port NNAddress1 = host1:port NNAddress2 = host2:port <property> <name>dfs.namenode.ids</name> <value>nn1, nn2</value> </property> <property> <name>dfs.namenode.rpc-address</name> <value>failoverAddress:port</value> </property> <property> <name>dfs.namenode.rpc-address.nn1</name> <value>host1:port</value> </property> <property> <name>dfs.namenode.rpc-address.nn2</name> <value>host2:port</value> </property>
        Hide
        Suresh Srinivas added a comment -

        Allen or Rajiv, any comments my previous comment?

        Show
        Suresh Srinivas added a comment - Allen or Rajiv, any comments my previous comment?
        Hide
        Allen Wittenauer added a comment -

        The config needs to be further broken down into what clients need and what the NN's need.

        Show
        Allen Wittenauer added a comment - The config needs to be further broken down into what clients need and what the NN's need.
        Hide
        Suresh Srinivas added a comment -

        Attached patch with proposed changes and tests.

        Show
        Suresh Srinivas added a comment - Attached patch with proposed changes and tests.
        Hide
        Aaron T. Myers added a comment -

        Hey Suresh, patch looks pretty good to me. Two small comments:

        1. Why remove DFSUtil.getNameServiceIdKey, when you could have just implemented this method in terms of DFSUtil.addSuffixes? Doing so will save you having to change all the references to DFSUtil.getNameServiceIdKey that are presently littered throughout the code.
        2. Should probably remove this: "System.out.println("Suresh returning key " + key + " " + keySuffix);", as not all users will know who "Suresh" is.
        Show
        Aaron T. Myers added a comment - Hey Suresh, patch looks pretty good to me. Two small comments: Why remove DFSUtil.getNameServiceIdKey , when you could have just implemented this method in terms of DFSUtil.addSuffixes ? Doing so will save you having to change all the references to DFSUtil.getNameServiceIdKey that are presently littered throughout the code. Should probably remove this: " System.out.println("Suresh returning key " + key + " " + keySuffix); ", as not all users will know who "Suresh" is.
        Hide
        Suresh Srinivas added a comment -

        Why remove DFSUtil.getNameServiceIdKey

        With federation there was only one suffix possible to key. With HA, it is combination of service ID and namenode Id. With that, just getNameServiceIdKey no longer makes sense. Hence the method is removed.

        as not all users will know who "Suresh" is.

        With log messages like that, that was exactly the problem I was trying to address

        Show
        Suresh Srinivas added a comment - Why remove DFSUtil.getNameServiceIdKey With federation there was only one suffix possible to key. With HA, it is combination of service ID and namenode Id. With that, just getNameServiceIdKey no longer makes sense. Hence the method is removed. as not all users will know who "Suresh" is. With log messages like that, that was exactly the problem I was trying to address
        Hide
        Aaron T. Myers added a comment -

        Just realized I said over in HDFS-1973 that I would comment on this JIRA about client-side conf changes, but it totally slipped my mind.

        Anyway, the only change to DFSConfigKeys in HDFS-1973 was to introduce "dfs.client.failover.proxy.provider", which is a prefix which allows one to configure a particular implementation of a FailoverProxyProvider for a given NN logical URI. That should should be compatible with the changes you've proposed here, though some of what went in to HDFS-1973 will need a little adaptation to take advantage of what you've implemented here.

        My intention in HDFS-1973 was that the various FailoverProxyProvider implementations would be responsible for their own configurations. For example, a ZK-based FailoverProxyProvider might need to know the quroum members. So, the ConfiguredFailoverProxyProvider introduced by HDFS-1973 introduced the config parameter "dfs.ha.namenode.addresses", which is a comma-separated list of actual (not logical) URIs. This is equivalent to the functionality introduced by the pair of configuration options dfs.namenode.ids and dfs.namenode.rpc-address.*, introduced this patch, and I like the design you have here better.

        I think it's fine to commit this as-designed now, and then we can fix up ConfiguredFailoverProxyProvider once this goes in. I've filed HDFS-2418 to take care of that.

        Show
        Aaron T. Myers added a comment - Just realized I said over in HDFS-1973 that I would comment on this JIRA about client-side conf changes, but it totally slipped my mind. Anyway, the only change to DFSConfigKeys in HDFS-1973 was to introduce "dfs.client.failover.proxy.provider" , which is a prefix which allows one to configure a particular implementation of a FailoverProxyProvider for a given NN logical URI. That should should be compatible with the changes you've proposed here, though some of what went in to HDFS-1973 will need a little adaptation to take advantage of what you've implemented here. My intention in HDFS-1973 was that the various FailoverProxyProvider implementations would be responsible for their own configurations. For example, a ZK-based FailoverProxyProvider might need to know the quroum members. So, the ConfiguredFailoverProxyProvider introduced by HDFS-1973 introduced the config parameter " dfs.ha.namenode.addresses ", which is a comma-separated list of actual (not logical) URIs. This is equivalent to the functionality introduced by the pair of configuration options dfs.namenode.ids and dfs.namenode.rpc-address.* , introduced this patch, and I like the design you have here better. I think it's fine to commit this as-designed now, and then we can fix up ConfiguredFailoverProxyProvider once this goes in. I've filed HDFS-2418 to take care of that.
        Hide
        Aaron T. Myers added a comment -

        That is to say, +1 - the latest patch looks good to me.

        Show
        Aaron T. Myers added a comment - That is to say, +1 - the latest patch looks good to me.
        Hide
        Suresh Srinivas added a comment -

        Updated patch with Todd's comments from HDFS-2301 addresses. Also added missing files.

        Show
        Suresh Srinivas added a comment - Updated patch with Todd's comments from HDFS-2301 addresses. Also added missing files.
        Hide
        Suresh Srinivas added a comment -

        Rebased to integrate HDFS-2301 changes.

        Show
        Suresh Srinivas added a comment - Rebased to integrate HDFS-2301 changes.
        Hide
        Suresh Srinivas added a comment -

        Todd, I have committed the patch. If you have any comments, will address it in a separate jira.

        Show
        Suresh Srinivas added a comment - Todd, I have committed the patch. If you have any comments, will address it in a separate jira.

          People

          • Assignee:
            Suresh Srinivas
            Reporter:
            Suresh Srinivas
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development