HBase
  1. HBase
  2. HBASE-2940

Improve behavior under partial failure of region servers

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: master, regionserver
    • Labels:
      None

      Description

      On larger clusters, we often see failure cases where a server is "up" (ie heartbeating) but unable to actually service requests properly (or at a reasonable speed). This can happen for any number of reasons including:

      • failing disks or disk controllers respond, but do so very slowly
      • the machine is swapping, so everything is still running but much more slowly than expected
      • HBase or the DN on the machine has been misconfigured (eg missing lzo libs) so it fails to correctly open regions, perform flushes, etc.

      Here are a few proposed features that are worth considering:
      1) Add a "blacklist" or "remote shutdown" functionality to the master. This is useful if the region server is up but for some reason the admin can't ssh in to shut it down (eg the root disk has failed). This feature would allow the admin to issue a command that will shut down any given RS.
      2) Periodically run a "health check" script on the region server node. If the script returns an error code, the RS could shut itself down gracefully and report an error message on the master console.
      3) Allow clients to report back RS-specific errors to the master. This would be useful for monitoring, and we could add heuristics to automatically shut down region servers if they have an elevated error count over some period of time.

        Activity

        Todd Lipcon created issue -
        Hide
        ryan rawson added a comment -

        I think the primary mechanism of shutdown/termination should be via the hlog
        block. The master should close the logfile then reassign regions. Since the
        hlog is gone any operations that were successful would terminate and the
        reassignment would prevent new clients from talking to the dead server.

        heartbeating) but unable to actually service requests properly (or at a
        reasonable speed). This can happen for any number of reasons including:
        slowly than expected
        libs) so it fails to correctly open regions, perform flushes, etc.
        This is useful if the region server is up but for some reason the admin
        can't ssh in to shut it down (eg the root disk has failed). This feature
        would allow the admin to issue a command that will shut down any given RS.
        the script returns an error code, the RS could shut itself down gracefully
        and report an error message on the master console.
        would be useful for monitoring, and we could add heuristics to automatically
        shut down region servers if they have an elevated error count over some
        period of time.

        Show
        ryan rawson added a comment - I think the primary mechanism of shutdown/termination should be via the hlog block. The master should close the logfile then reassign regions. Since the hlog is gone any operations that were successful would terminate and the reassignment would prevent new clients from talking to the dead server. heartbeating) but unable to actually service requests properly (or at a reasonable speed). This can happen for any number of reasons including: slowly than expected libs) so it fails to correctly open regions, perform flushes, etc. This is useful if the region server is up but for some reason the admin can't ssh in to shut it down (eg the root disk has failed). This feature would allow the admin to issue a command that will shut down any given RS. the script returns an error code, the RS could shut itself down gracefully and report an error message on the master console. would be useful for monitoring, and we could add heuristics to automatically shut down region servers if they have an elevated error count over some period of time.
        Hide
        Ted Yu added a comment -

        Since hbase.rootdir points to hadoop namenode, HBase Master can poll hadoop for the live data nodes. If a data node comes down for longer than specified duration and a RS happens to be on the same server, Master can blacklist that RS (assuming there is problem with heartbeat from that RS in the same time period).

        Show
        Ted Yu added a comment - Since hbase.rootdir points to hadoop namenode, HBase Master can poll hadoop for the live data nodes. If a data node comes down for longer than specified duration and a RS happens to be on the same server, Master can blacklist that RS (assuming there is problem with heartbeat from that RS in the same time period).
        Hide
        Jonathan Gray added a comment -

        I like this direction. We're starting to use HBCK fairly heavily as a periodic health-check script and there were some ideas about some kind of basic read/write verification test on each RS as part of it.

        @Todd, #1 sounds good. There has been talk from ops guys here about having a separate file with a list of blacklisted RS (there's something like this in hadoop I believe), so you can add nodes under maintenance or blacklisted from something related to this jira. #2, see above. definitely an RS sanity check would be nice (can you append to log, can you do basic read/write, etc). #3, interesting. need to think on that more.

        @Ryan, that seems like a secondary mechanism for shutting down an RS because it will always require log replay. Though the reasons above might require a forceful external abort, if the RS is responsive, we should do a controlled shutdown so regions can be flushed. If it takes too long or RS is unresponsive, then using the HLog sounds like a good strategy. Need to be sure whatever properties of hdfs appends we're using will not change between the 0.20 implementation and the 0.21 and later one.

        @Ted, HBase is far better at determining live/dead nodes than the NN (zk vs 3 minute timeout heartbeats), so I wouldn't expect this to be a big win. It's also an open question whether you would always want an RS with a dead DN on the same machine to also go down. Maybe there is a situation where this would be useful information to have in HBase but need to think on it more. In most instances, if there is a problem with the node and we want HBase to proactively kill an RS, we would know in HBase-land.

        Show
        Jonathan Gray added a comment - I like this direction. We're starting to use HBCK fairly heavily as a periodic health-check script and there were some ideas about some kind of basic read/write verification test on each RS as part of it. @Todd, #1 sounds good. There has been talk from ops guys here about having a separate file with a list of blacklisted RS (there's something like this in hadoop I believe), so you can add nodes under maintenance or blacklisted from something related to this jira. #2, see above. definitely an RS sanity check would be nice (can you append to log, can you do basic read/write, etc). #3, interesting. need to think on that more. @Ryan, that seems like a secondary mechanism for shutting down an RS because it will always require log replay. Though the reasons above might require a forceful external abort, if the RS is responsive, we should do a controlled shutdown so regions can be flushed. If it takes too long or RS is unresponsive, then using the HLog sounds like a good strategy. Need to be sure whatever properties of hdfs appends we're using will not change between the 0.20 implementation and the 0.21 and later one. @Ted, HBase is far better at determining live/dead nodes than the NN (zk vs 3 minute timeout heartbeats), so I wouldn't expect this to be a big win. It's also an open question whether you would always want an RS with a dead DN on the same machine to also go down. Maybe there is a situation where this would be useful information to have in HBase but need to think on it more. In most instances, if there is a problem with the node and we want HBase to proactively kill an RS, we would know in HBase-land.
        Hide
        Andrew Purtell added a comment -

        Superseded by the MTTR constellation of issues.

        Show
        Andrew Purtell added a comment - Superseded by the MTTR constellation of issues.
        Andrew Purtell made changes -
        Field Original Value New Value
        Resolution Done [ 11 ]
        Status Open [ 1 ] Resolved [ 5 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Todd Lipcon
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development