Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21146

(2.0) Add ability for HBase Canary to ignore a configurable number of ZooKeeper down nodes

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.0.0, 3.0.0-alpha-1, 2.0.0
    • 2.0.4
    • canary, Zookeeper
    • None
    • Reviewed
    • Hide
      Adds -permittedZookeeperFailures <N>

      Makes it so Canary will keep running reporting on downed zk ensemble members rather than exit.
      Show
      Adds -permittedZookeeperFailures <N> Makes it so Canary will keep running reporting on downed zk ensemble members rather than exit.

    Description

      When running org.apache.hadoop.hbase.tool.Canary with args -zookeeper -treatFailureAsError, the Canary will try to get a znode from each ZooKeeper server in the ensemble. If any server is unavailable or unresponsive, the canary will exit with a failure code.

      If we use the Canary to gauge server health, and alert accordingly, this can be too strict. For example, in a 5-node ZooKeeper cluster, having one node down is safe and expected in rolling upgrades/patches.

      This is a request to allow the Canary to take another parameter

      -permittedZookeeperFailures <N>

      If N=1, in the 5-node ZooKeeper ensemble example, then the Canary will still pass if 4 ZooKeeper nodes are reachable, but fail if 3 or fewer are reachable.

      (This is my first Jira posting... sorry if I messed anything up.)

      Attachments

        1. HBASE-21126.branch-1.001.patch
          11 kB
          Josh Elser
        2. HBASE-21126.master.001.patch
          11 kB
          Josh Elser
        3. HBASE-21126.master.002.patch
          11 kB
          Josh Elser
        4. HBASE-21126.master.003.patch
          11 kB
          Josh Elser
        5. HBASE-21146.branch-2.0.001.patch
          11 kB
          David Manning
        6. zookeeperCanaryLocalTestValidation.txt
          37 kB
          Josh Elser

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dmanning David Manning
            dmanning David Manning
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment