HBase
  1. HBase
  2. HBASE-2174

Stop from resolving HRegionServer addresses to names using DNS on every heartbeat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.4, 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Over the time many parts of the code have evolved in different ways and one issue is that addresses are handled differently in different parts of the code. We need to set a standard and correct any inconsistencies.

      1. HBASE-2174_0.20.3.patch
        4 kB
        Karthik Ranganathan

        Activity

        Jean-Daniel Cryans created issue -
        Hide
        Jean-Daniel Cryans added a comment -

        One example of weirdness is when the region server is told which address to use according to the master:

        INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us address to use. Was=10.10.21.16:60020, Now=10.10.21.16
        
        Show
        Jean-Daniel Cryans added a comment - One example of weirdness is when the region server is told which address to use according to the master: INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us address to use. Was=10.10.21.16:60020, Now=10.10.21.16
        Hide
        Jean-Daniel Cryans added a comment -

        Another thing people see sometimes is that their region servers will have 2-3 znodes registered in zookeeper, I wasn't able to figure what the problem was right away but the issue exists and it's surely in the scope of this jira.

        Show
        Jean-Daniel Cryans added a comment - Another thing people see sometimes is that their region servers will have 2-3 znodes registered in zookeeper, I wasn't able to figure what the problem was right away but the issue exists and it's surely in the scope of this jira.
        Hide
        Kannan Muthukkaruppan added a comment -

        JD: The cluster coming down when DNS is flaky (the issue we reported on hbase-dev yesterday) is basically a dup of HBASE-1679. It causes the master to send a "startup" message to an already running region server. This then leads to the region server creating new znodes in zookeeper (of the form /hbase/rs/<startcode>).

        Show
        Kannan Muthukkaruppan added a comment - JD: The cluster coming down when DNS is flaky (the issue we reported on hbase-dev yesterday) is basically a dup of HBASE-1679 . It causes the master to send a "startup" message to an already running region server. This then leads to the region server creating new znodes in zookeeper (of the form /hbase/rs/<startcode>).
        Hide
        Kannan Muthukkaruppan added a comment -

        To fill in some more details, this is what we think happened during DNS flakiness:

        A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.

        The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:

        HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
        if (storedInfo == null) {
        if (LOG.isDebugEnabled())

        { LOG.debug("Received report from unknown server -- telling it " + <<============ "to " + CALL_SERVER_STARTUP + ": " + info.getServerName()); <<============ }

        and bad things happened down the road (such as the region server registering itself multiple times in Zookeeper, cluster coming down, etc.).

        The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.

        2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server – telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210

        Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.

        Show
        Kannan Muthukkaruppan added a comment - To fill in some more details, this is what we think happened during DNS flakiness: A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format. The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to: HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); if (storedInfo == null) { if (LOG.isDebugEnabled()) { LOG.debug("Received report from unknown server -- telling it " + <<============ "to " + CALL_SERVER_STARTUP + ": " + info.getServerName()); <<============ } and bad things happened down the road (such as the region server registering itself multiple times in Zookeeper, cluster coming down, etc.). The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format. 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server – telling it to MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210 – Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.
        Hide
        ryan rawson added a comment -

        im wondering if we should do something radical and flexible in this area...

        right now if we use IP address, we can make it hard for clients to talk to hbase, since if their knowledge of hbase's IP and the one HBase uses are different, then bam. Think: ec2 internal/external addresses.

        Maybe it makes sense to always use symbolic names and resolve them at the client when they need to make RPC calls?

        Show
        ryan rawson added a comment - im wondering if we should do something radical and flexible in this area... right now if we use IP address, we can make it hard for clients to talk to hbase, since if their knowledge of hbase's IP and the one HBase uses are different, then bam. Think: ec2 internal/external addresses. Maybe it makes sense to always use symbolic names and resolve them at the client when they need to make RPC calls?
        Hide
        Joydeep Sen Sarma added a comment -

        yeah - hostnames are much more flexible. we can review what hdfs does - it already works (for these reasons) in AWS (I was able to get my home pc to talk directly to hdfs nodes inside AWS once).

        hadoop also supports configurable hostname per machine (via conf file) (we saw this when we were comparing the hbase and hadoop hostname related code). there's probably some interesting use case for that as well.

        Show
        Joydeep Sen Sarma added a comment - yeah - hostnames are much more flexible. we can review what hdfs does - it already works (for these reasons) in AWS (I was able to get my home pc to talk directly to hdfs nodes inside AWS once). hadoop also supports configurable hostname per machine (via conf file) (we saw this when we were comparing the hbase and hadoop hostname related code). there's probably some interesting use case for that as well.
        Hide
        Karthik Ranganathan added a comment -

        @Joydeep/Ryan - good point about EC2. They expose a different IP to the outside world (elastic IP) which is different from the actual machine ip's. We should definitely stick to hostnames.

        I guess we should just stop the master from resolving each time and take what the client says its hostname-ip pair is as the truth. Zookeeper currently stores the ip of the root region server - not sure if thats an issue.

        Show
        Karthik Ranganathan added a comment - @Joydeep/Ryan - good point about EC2. They expose a different IP to the outside world (elastic IP) which is different from the actual machine ip's. We should definitely stick to hostnames. I guess we should just stop the master from resolving each time and take what the client says its hostname-ip pair is as the truth. Zookeeper currently stores the ip of the root region server - not sure if thats an issue.
        Hide
        Kannan Muthukkaruppan added a comment -

        Also, in '.META.', the region assignments are stored as IP addresses. Wondering if that could present some "upgrade/compatibility" issues if everything was changed to be hostname based.

        Show
        Kannan Muthukkaruppan added a comment - Also, in '.META.', the region assignments are stored as IP addresses. Wondering if that could present some "upgrade/compatibility" issues if everything was changed to be hostname based.
        Hide
        Karthik Ranganathan added a comment -

        Add a fix on the HMaster side to stop it from resolving HRegionServer addresses to names using DNS.

        Show
        Karthik Ranganathan added a comment - Add a fix on the HMaster side to stop it from resolving HRegionServer addresses to names using DNS.
        Karthik Ranganathan made changes -
        Field Original Value New Value
        Attachment HBASE-2174_0.20.3.patch [ 12436659 ]
        Hide
        Karthik Ranganathan added a comment -

        Hey guys, I have attached a patch for this issue - its a very simple fix on the HMaster side. Do let me know if you have any comments.

        Show
        Karthik Ranganathan added a comment - Hey guys, I have attached a patch for this issue - its a very simple fix on the HMaster side. Do let me know if you have any comments.
        Hide
        Andrew Purtell added a comment -

        +1 on patch, tests running now, will commit if ok.

        Show
        Andrew Purtell added a comment - +1 on patch, tests running now, will commit if ok.
        Hide
        stack added a comment -

        Patch looks good. I think though that if this patch goes in, to be safe, we should require a restart of hbase upgrading patch versions. So, if we're talking about supporting a rolling restart in 0.20.4, this would go into a 0.20.5.... but its looking like hbase-2248 fix needs to go into 0.20.4 and if so, it dictates a restart.

        Show
        stack added a comment - Patch looks good. I think though that if this patch goes in, to be safe, we should require a restart of hbase upgrading patch versions. So, if we're talking about supporting a rolling restart in 0.20.4, this would go into a 0.20.5.... but its looking like hbase-2248 fix needs to go into 0.20.4 and if so, it dictates a restart.
        Hide
        Andrew Purtell added a comment -

        Tests look good. Commit to branch now or wait?

        Show
        Andrew Purtell added a comment - Tests look good. Commit to branch now or wait?
        Hide
        stack added a comment -

        I made a 0.20.5 and moved it here for now. If 0.20.4 requires restart, we'll pull it back in to 0.20.4. Thanks Andrew.

        Show
        stack added a comment - I made a 0.20.5 and moved it here for now. If 0.20.4 requires restart, we'll pull it back in to 0.20.4. Thanks Andrew.
        stack made changes -
        Fix Version/s 0.20.5 [ 12314800 ]
        Fix Version/s 0.21.0 [ 12313607 ]
        stack made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hadoop Flags [Reviewed]
        Hide
        stack added a comment -

        Changed subject to match the posted patch. Will open new issue to take up where this patch leaves off to finish the review of hbase to ensure we use hostnames everywhere...

        Show
        stack added a comment - Changed subject to match the posted patch. Will open new issue to take up where this patch leaves off to finish the review of hbase to ensure we use hostnames everywhere...
        stack made changes -
        Summary Review how we handle addresses in HBase Stop from resolving HRegionServer addresses to names using DNS on every heartbeat
        Hide
        stack added a comment -

        Applied branch and trunk. Thanks for the patch Karthik.

        Show
        stack added a comment - Applied branch and trunk. Thanks for the patch Karthik.
        stack made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.21.0 [ 12313607 ]
        Resolution Fixed [ 1 ]
        stack made changes -
        Assignee Karthik Ranganathan [ karthik.ranga ]
        Hide
        Jonathan Gray added a comment -

        This was committed to branch already which will be 0.20.4 not 0.20.5 (just updating fix version for clarity)

        Show
        Jonathan Gray added a comment - This was committed to branch already which will be 0.20.4 not 0.20.5 (just updating fix version for clarity)
        Jonathan Gray made changes -
        Fix Version/s 0.20.4 [ 12314496 ]
        Fix Version/s 0.20.5 [ 12314800 ]
        stack made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        25d 1h 1 stack 23/Feb/10 18:37
        Patch Available Patch Available Resolved Resolved
        16d 1h 30m 1 stack 11/Mar/10 20:07
        Resolved Resolved Closed Closed
        945d 10h 7m 1 stack 12/Oct/12 07:14

          People

          • Assignee:
            Karthik Ranganathan
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development