Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-2940

Improve behavior under partial failure of region servers

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Done
    • None
    • None
    • master, regionserver
    • None

    Description

      On larger clusters, we often see failure cases where a server is "up" (ie heartbeating) but unable to actually service requests properly (or at a reasonable speed). This can happen for any number of reasons including:

      • failing disks or disk controllers respond, but do so very slowly
      • the machine is swapping, so everything is still running but much more slowly than expected
      • HBase or the DN on the machine has been misconfigured (eg missing lzo libs) so it fails to correctly open regions, perform flushes, etc.

      Here are a few proposed features that are worth considering:
      1) Add a "blacklist" or "remote shutdown" functionality to the master. This is useful if the region server is up but for some reason the admin can't ssh in to shut it down (eg the root disk has failed). This feature would allow the admin to issue a command that will shut down any given RS.
      2) Periodically run a "health check" script on the region server node. If the script returns an error code, the RS could shut itself down gracefully and report an error message on the master console.
      3) Allow clients to report back RS-specific errors to the master. This would be useful for monitoring, and we could add heuristics to automatically shut down region servers if they have an elevated error count over some period of time.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tlipcon Todd Lipcon
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment