HBase
  1. HBase
  2. HBASE-4393

Implement a canary monitoring program

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.94.0, 0.95.0
    • Component/s: monitoring
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Tool to check cluster. See $ ./bin/hbase org.apache.hadoop.hbase.tool.Canary -help for how to use.
    • Tags:
      0.96notable

      Description

      This JIRA is to implement a standalone program that can be used to do "canary monitoring" of a running HBase cluster. This program would gather a list of the regions in the cluster, then iterate over them doing lightweight operations (eg short scans) to provide metrics about latency as well as alert on availability issues.

      1. HBASE-4393-v0.patch
        8 kB
        Matteo Bertozzi
      2. Canary-v0.java
        8 kB
        Matteo Bertozzi
      3. HBaseCanary.java
        4 kB
        Matteo Bertozzi

        Activity

        Show
        Otis Gospodnetic added a comment - Todd, where/how would the metrics be published? JMX perhaps? Please see my comment on HBASE-4147 : https://issues.apache.org/jira/browse/HBASE-4147?focusedCommentId=13104623&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13104623
        Hide
        Nicolas Spiegelberg added a comment -

        What sort of benefit do you see with doing lightweight operations versus something like RPC sampling?

        Show
        Nicolas Spiegelberg added a comment - What sort of benefit do you see with doing lightweight operations versus something like RPC sampling?
        Hide
        Todd Lipcon added a comment -

        RPC sampling on the server side won't tell you if, for example, one of the servers in the cluster has a faulty NIC and thus is dropping packets and has very high latency. The latency "inside" the server will be fast, but for any clients, it will be slow.

        Availability-wise, we sometimes have clusters which only sporadically see access (eg from an MR job that runs every hour). In that case, it's nice to have a canary monitor to determine if one of the region servers is having issues before the job runs and times out. We often find out about these kind of issues from a job failing, instead of proactively from monitoring, since all of the servers are "up", just one region in some kind of broken state.

        Show
        Todd Lipcon added a comment - RPC sampling on the server side won't tell you if, for example, one of the servers in the cluster has a faulty NIC and thus is dropping packets and has very high latency. The latency "inside" the server will be fast, but for any clients, it will be slow. Availability-wise, we sometimes have clusters which only sporadically see access (eg from an MR job that runs every hour). In that case, it's nice to have a canary monitor to determine if one of the region servers is having issues before the job runs and times out. We often find out about these kind of issues from a job failing, instead of proactively from monitoring, since all of the servers are "up", just one region in some kind of broken state.
        Hide
        Jeff Bean added a comment -

        I'm working on a simple canary that tries to fetch a row from each column family at each region. I'll post it on successful test.

        Show
        Jeff Bean added a comment - I'm working on a simple canary that tries to fetch a row from each column family at each region. I'll post it on successful test.
        Hide
        Matteo Bertozzi added a comment -

        I've attached a simple draft canary tool, that foreach table (or for the specified tables) tries to fetch a row from each region server, collects and print failures and times.

        should this tool be a service that collect/expose stats for each region/column family or just a tool to get an idea on the cluster state?

        In case this should be just a tool, any idea on the output format, the metrics that we want collect and output?

        Show
        Matteo Bertozzi added a comment - I've attached a simple draft canary tool, that foreach table (or for the specified tables) tries to fetch a row from each region server, collects and print failures and times. should this tool be a service that collect/expose stats for each region/column family or just a tool to get an idea on the cluster state? In case this should be just a tool, any idea on the output format, the metrics that we want collect and output?
        Hide
        Lars Hofhansl added a comment -

        @Matteo: Ideally this could be used for trending. So output that is suitable for Ganglia or OpenTSDB (whatever that means in both cases) would be cool.
        Even just a cluster state tool is great.

        Show
        Lars Hofhansl added a comment - @Matteo: Ideally this could be used for trending. So output that is suitable for Ganglia or OpenTSDB (whatever that means in both cases) would be cool. Even just a cluster state tool is great.
        Hide
        Lars Hofhansl added a comment -

        Java code looks great. Maybe instead of using a scanner in sniffRegion, you could use a Get?

        Show
        Lars Hofhansl added a comment - Java code looks great. Maybe instead of using a scanner in sniffRegion, you could use a Get?
        Hide
        Lars Hofhansl added a comment -

        I would like to get this into 0.94.
        This needs some of usage description so that folks can find out what you are supposed to pass on the command line.

        Show
        Lars Hofhansl added a comment - I would like to get this into 0.94. This needs some of usage description so that folks can find out what you are supposed to pass on the command line.
        Hide
        stack added a comment -

        I wrote a note to suggest this NOT be added to 0.94 because its only basic and its apart from hbase so we shouldn't have to hold up hbase to get it in. Was also going to talk about this tool being too basic – emissions are on stdout only rather than up in jmx, formatted as json or whatever – but then I thought we have to start somewhere. We can add to this basic tool later.

        The class needs a license and a class comment.

        Should be called Canary rather than HBaseCanary.

        Put it into a package. Would suggest we start a tool package so o.a.h.h.tool.

        Should implement Tool and be run using ToolRunner. Tool adds a little useful util.

        Needs usage as per lars.

        Could be added to bin/hbase as 'canary' – could start/stop it like we start/stop region. If you do this, then things like log name and location will be set up for you as it is for rest server and thrift server etc.

        Should output be via LOG rather than stdout? Then we can hook its output up variously.

        Skip formatting in the output...the ' - Region ..' i.e. remove the ' - ' prefix.

        I think make the few small changes above and we'd have a good start. Thanks lads. Good stuff

        Show
        stack added a comment - I wrote a note to suggest this NOT be added to 0.94 because its only basic and its apart from hbase so we shouldn't have to hold up hbase to get it in. Was also going to talk about this tool being too basic – emissions are on stdout only rather than up in jmx, formatted as json or whatever – but then I thought we have to start somewhere. We can add to this basic tool later. The class needs a license and a class comment. Should be called Canary rather than HBaseCanary. Put it into a package. Would suggest we start a tool package so o.a.h.h.tool. Should implement Tool and be run using ToolRunner. Tool adds a little useful util. Needs usage as per lars. Could be added to bin/hbase as 'canary' – could start/stop it like we start/stop region. If you do this, then things like log name and location will be set up for you as it is for rest server and thrift server etc. Should output be via LOG rather than stdout? Then we can hook its output up variously. Skip formatting in the output...the ' - Region ..' i.e. remove the ' - ' prefix. I think make the few small changes above and we'd have a good start. Thanks lads. Good stuff
        Hide
        Lars Hofhansl added a comment -

        You are more iron-handy than me, stack. Your points are well taken, unscheduling.

        Show
        Lars Hofhansl added a comment - You are more iron-handy than me, stack. Your points are well taken, unscheduling.
        Hide
        stack added a comment -

        Hmm... this thing comes up reports and goes down immediately so some of my suggestions above may be OTT. So, I don't think we need the following to get the script in (we can add it later):

        "Could be added to bin/hbase as 'canary' – could start/stop it like we start/stop region. If you do this, then things like log name and location will be set up for you as it is for rest server and thrift server etc."

        Show
        stack added a comment - Hmm... this thing comes up reports and goes down immediately so some of my suggestions above may be OTT. So, I don't think we need the following to get the script in (we can add it later): "Could be added to bin/hbase as 'canary' – could start/stop it like we start/stop region. If you do this, then things like log name and location will be set up for you as it is for rest server and thrift server etc."
        Hide
        Matteo Bertozzi added a comment -

        I've attached a new version of the canary following the stack comments.

        Now the canary is a tool, and has a command line with a couple of options.

        In this implementations the canary runs "forever" and has a pluggable sink interface to collect and output failures and read latencies.

        At the moment the only sink implemented is the FileSink that allows to use a file or stdout as output device. But we can add support for the hadoop metrics later.

        Show
        Matteo Bertozzi added a comment - I've attached a new version of the canary following the stack comments. Now the canary is a tool, and has a command line with a couple of options. In this implementations the canary runs "forever" and has a pluggable sink interface to collect and output failures and read latencies. At the moment the only sink implemented is the FileSink that allows to use a file or stdout as output device. But we can add support for the hadoop metrics later.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519998/Canary-v0.java
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1367//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519998/Canary-v0.java against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1367//console This message is automatically generated.
        Hide
        stack added a comment -

        Please format your contrib as a patch (git or svn add then do a git diff --no-prefix or svn diff). Thanks.

        This line is not necessary any more:

         * Copyright 2012 The Apache Software Foundation
        

        Please fix this doc: ' * HBase Canary Tool, that that can be used to do' (too many 'that's)

        On the Sink interface, its not going to be used by anyone else if its private? That might be fine for first checkin. Later when other Sinks we can open it up?

        I think filesink is the wrong sink to do as first implementation. Your first Sink should be StdOutSink using Logging system. Notice how anything that is started with bin/hbase-daemon.sh gets log files set up for it (master, regionserver, but also rest, thrift, etc.). Doing this, your emissions will be in a well-known place in files that are named with a format that matches other loggings made by hbase, etc.

        This method is oddly named:

         public void publish(HRegionInfo region, HColumnDescriptor column, long msTime) {
        

        It seems like its for logging messages like this: "%s read from region %s column family %s in %dms\n",

        ... should method name be logReadTime? Or publishReadTiming?

        Whats the BasicParser do? It matches what Tool does? We don't want GnuParser?

        I like this comment:

            // user has specified an interval for canary breaths
        

        Thats cute.

        Put on one line:

            if (conf == null)
              conf = HBaseConfiguration.create();
        

        I think I should be able to run this once OR run it as a daemon. Pass an arg if its to run as daemon process?

        Can this code use any of the utility that is in hbck?

        I like the Tool improvements.

        Thanks Matteo.

        Show
        stack added a comment - Please format your contrib as a patch (git or svn add then do a git diff --no-prefix or svn diff). Thanks. This line is not necessary any more: * Copyright 2012 The Apache Software Foundation Please fix this doc: ' * HBase Canary Tool, that that can be used to do' (too many 'that's) On the Sink interface, its not going to be used by anyone else if its private? That might be fine for first checkin. Later when other Sinks we can open it up? I think filesink is the wrong sink to do as first implementation. Your first Sink should be StdOutSink using Logging system. Notice how anything that is started with bin/hbase-daemon.sh gets log files set up for it (master, regionserver, but also rest, thrift, etc.). Doing this, your emissions will be in a well-known place in files that are named with a format that matches other loggings made by hbase, etc. This method is oddly named: public void publish(HRegionInfo region, HColumnDescriptor column, long msTime) { It seems like its for logging messages like this: "%s read from region %s column family %s in %dms\n", ... should method name be logReadTime? Or publishReadTiming? Whats the BasicParser do? It matches what Tool does? We don't want GnuParser? I like this comment: // user has specified an interval for canary breaths Thats cute. Put on one line: if (conf == null ) conf = HBaseConfiguration.create(); I think I should be able to run this once OR run it as a daemon. Pass an arg if its to run as daemon process? Can this code use any of the utility that is in hbck? I like the Tool improvements. Thanks Matteo.
        Hide
        Matteo Bertozzi added a comment -

        use LOG as default Sink, renamed methods as suggested and removed the BasicParser.

        @Stack
        What is your idea, about integrating hbck? some sort of automatic recovery in some codition?

        Show
        Matteo Bertozzi added a comment - use LOG as default Sink, renamed methods as suggested and removed the BasicParser. @Stack What is your idea, about integrating hbck? some sort of automatic recovery in some codition?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12521688/HBASE-4393-v0.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1430//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1430//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1430//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12521688/HBASE-4393-v0.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1430//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1430//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1430//console This message is automatically generated.
        Hide
        stack added a comment -

        Committed to trunk. Thanks for the patch Matteo. I tried it out. Does the basics. Nice. Thanks.

        Show
        stack added a comment - Committed to trunk. Thanks for the patch Matteo. I tried it out. Does the basics. Nice. Thanks.
        Hide
        stack added a comment -

        Committed to 0.94 (thought you might like this Lars).

        Show
        stack added a comment - Committed to 0.94 (thought you might like this Lars).
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #142 (See https://builds.apache.org/job/HBase-0.94/142/)
        HBASE-4393 Implement a canary monitoring program (Revision 1329575)

        Result = SUCCESS
        stack :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #142 (See https://builds.apache.org/job/HBase-0.94/142/ ) HBASE-4393 Implement a canary monitoring program (Revision 1329575) Result = SUCCESS stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2805 (See https://builds.apache.org/job/HBase-TRUNK/2805/)
        HBASE-4393 Implement a canary monitoring program (Revision 1329574)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2805 (See https://builds.apache.org/job/HBase-TRUNK/2805/ ) HBASE-4393 Implement a canary monitoring program (Revision 1329574) Result = FAILURE stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Hide
        Ted Yu added a comment -

        This checkin might be related to:

        [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.8:check (default) on project hbase: Too many unapproved licenses: 1 -> [Help 1]
        
        Show
        Ted Yu added a comment - This checkin might be related to: [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.8:check ( default ) on project hbase: Too many unapproved licenses: 1 -> [Help 1]
        Hide
        stack added a comment -

        No. This patch has a license. The failure was because of the OOME. RAT complaint is this:

        
        Unapproved licenses:
        
          hs_err_pid23951.log
        
        ...
        
        Show
        stack added a comment - No. This patch has a license. The failure was because of the OOME. RAT complaint is this: Unapproved licenses: hs_err_pid23951.log ...
        Hide
        stack added a comment -

        @Ted FYI, if you go under artifacts produced by the build into the target dir, you can see the rat.txt now. Thats where I got the above from. Thinking on it, I also went to change the build order so site goes first so we'll fail fast if a rat problem but it seems build is already this way – must run unit tests up front anyways.

        Show
        stack added a comment - @Ted FYI, if you go under artifacts produced by the build into the target dir, you can see the rat.txt now. Thats where I got the above from. Thinking on it, I also went to change the build order so site goes first so we'll fail fast if a rat problem but it seems build is already this way – must run unit tests up front anyways.
        Hide
        Lars Francke added a comment -

        The Canary.java file has a wrong package definition.

        Lives in org.apache.hadoop.hbase.tools but says package org.apache.hadoop.hbase.tool;

        New issue?

        Show
        Lars Francke added a comment - The Canary.java file has a wrong package definition. Lives in org.apache.hadoop.hbase.tools but says package org.apache.hadoop.hbase.tool; New issue?
        Hide
        stack added a comment -

        Let me take care of it Lars. I should have seen that. Thanks for pointing it out. Fixed over in HBASE-5866.

        Show
        stack added a comment - Let me take care of it Lars. I should have seen that. Thanks for pointing it out. Fixed over in HBASE-5866 .
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #20 (See https://builds.apache.org/job/HBase-0.94-security/20/)
        HBASE-4393 Implement a canary monitoring program (Revision 1329575)

        Result = SUCCESS
        stack :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #20 (See https://builds.apache.org/job/HBase-0.94-security/20/ ) HBASE-4393 Implement a canary monitoring program (Revision 1329575) Result = SUCCESS stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Hide
        Lars Hofhansl added a comment -

        Thanks Stack... I do like this in 0.94 (- the wrong package of course )

        Show
        Lars Hofhansl added a comment - Thanks Stack... I do like this in 0.94 (- the wrong package of course )
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #183 (See https://builds.apache.org/job/HBase-TRUNK-security/183/)
        HBASE-4393 Implement a canary monitoring program (Revision 1329574)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #183 (See https://builds.apache.org/job/HBase-TRUNK-security/183/ ) HBASE-4393 Implement a canary monitoring program (Revision 1329574) Result = FAILURE stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools /hbase/trunk/src/main/java/org/apache/hadoop/hbase/tools/Canary.java
        Hide
        Jeremy Carroll added a comment -

        Just wanted to put in a few operational comments. We have a version of this Canary script hooked up to our current HBase cluster for monitoring. It works well to determine if your cluster is responding to RPC's in a health amount of time. But it does not work well to determine latency for requests overall as the getStartKey becomes cached. Since a request for the same key over, and over again is basically cache warming it returns in <1ms every time after a few iterations.

        We played around with the idea of using a random request within the RegionServer to get non-cache latency responses. In this scenario we basically are testing our disk latency. IMHO the intention of the Canary is not to test my disk response but the overall response / health of the HBase RegionServer. We took an approach to use the fsLatency histogram metrics (99, 999th percent) in a separate check in addition to the Canary for overall health status.

        Show
        Jeremy Carroll added a comment - Just wanted to put in a few operational comments. We have a version of this Canary script hooked up to our current HBase cluster for monitoring. It works well to determine if your cluster is responding to RPC's in a health amount of time. But it does not work well to determine latency for requests overall as the getStartKey becomes cached. Since a request for the same key over, and over again is basically cache warming it returns in <1ms every time after a few iterations. We played around with the idea of using a random request within the RegionServer to get non-cache latency responses. In this scenario we basically are testing our disk latency. IMHO the intention of the Canary is not to test my disk response but the overall response / health of the HBase RegionServer. We took an approach to use the fsLatency histogram metrics (99, 999th percent) in a separate check in addition to the Canary for overall health status.
        Hide
        takeshi.miao added a comment -

        There are 4 differences compared with #HBASE-4393
        1. this tool will take any one region from each region server to monitor, not every region in whole HBase cluster
        2. this tool was implemented with multi-threaded feature, so it will not be blocked if any region server being hung
        3. this tool is taking one or more region server FQDN as options, then will monitor the given region servers
        3.1 monitor all region servers if no option given
        4. this tool can also take one or more regular expression patterns for region server FQDN for user easily use

        I use this tool on our internal HBase operation, so I think that other people may have the identical requirements

        Show
        takeshi.miao added a comment - There are 4 differences compared with # HBASE-4393 1. this tool will take any one region from each region server to monitor, not every region in whole HBase cluster 2. this tool was implemented with multi-threaded feature, so it will not be blocked if any region server being hung 3. this tool is taking one or more region server FQDN as options, then will monitor the given region servers 3.1 monitor all region servers if no option given 4. this tool can also take one or more regular expression patterns for region server FQDN for user easily use I use this tool on our internal HBase operation, so I think that other people may have the identical requirements
        Hide
        Andrew Purtell added a comment -

        takeshi.miao Would you be interested in contributing your canary tool? Is it based on the old one? Please consider opening a new issue, this one is closed.

        Show
        Andrew Purtell added a comment - takeshi.miao Would you be interested in contributing your canary tool? Is it based on the old one? Please consider opening a new issue, this one is closed.
        Hide
        takeshi.miao added a comment -

        Sorry, Andrew Purtell, I put this comment on the wrong ticket, move it to HBASE-7525, and thanks for your reminding

        Show
        takeshi.miao added a comment - Sorry, Andrew Purtell, I put this comment on the wrong ticket, move it to HBASE-7525 , and thanks for your reminding

          People

          • Assignee:
            Matteo Bertozzi
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development