Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-15219

Canary tool does not return non-zero exit code when one of regions is in stuck state



    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.98.16
    • 1.3.0, 1.2.1, 0.98.18, 2.0.0
    • canary
    • None
    • Reviewed
    • Hide
      A new flag is added for Canary tool: -treatFailureAsError
      When this flag is specified, read / write failure would result in Canary tool exit code of 5.
      A new flag is added for Canary tool: -treatFailureAsError When this flag is specified, read / write failure would result in Canary tool exit code of 5.


      2016-02-05 12:24:18,571 ERROR [pool-2-thread-7] tool.Canary - read from region CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418. column family 0 failed
      org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
      Fri Feb 05 12:24:15 GMT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@54c9fea0, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418. is not online on isthbase02-dnds1-3-crd.eng.sfdc.net,60020,1454669984738
      	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2852)
      	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4468)
      	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2984)
      	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31186)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2149)
      	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
      	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
      	at java.lang.Thread.run(Thread.java:745)
      -bash-4.1$ echo $?

      Below code prints the error but it does sets/returns the exit code. Due to this tool can't be integrated with nagios or other alerting.

      Ideally it should return error for failures. as pre the documentation:

      This tool will return non zero error codes to user for collaborating with other monitoring tools, such as Nagios. The error code definitions are:

      private static final int USAGE_EXIT_CODE = 1;
      private static final int INIT_ERROR_EXIT_CODE = 2;
      private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
      private static final int ERROR_EXIT_CODE = 4;


      public Void read() {
            try {
              table = connection.getTable(region.getTable());
              tableDesc = table.getTableDescriptor();
            } catch (IOException e) {
              LOG.debug("sniffRegion failed", e);
              sink.publishReadFailure(region, e);
              return null;


        1. HBASE-15219-branch-1.2.v8.patch
          10 kB
          Ted Yu
        2. HBASE-15219.v9.patch
          10 kB
          Ted Yu
        3. HBASE-15219.v8.patch
          10 kB
          Ted Yu
        4. HBASE-15219.v7.patch
          10 kB
          Ted Yu
        5. HBASE-15219.v5.patch
          9 kB
          Ted Yu
        6. HBASE-15219.v4.patch
          9 kB
          Ted Yu
        7. HBASE-15219.v3.patch
          8 kB
          Ted Yu
        8. HBASE-15219.v1.patch
          3 kB
          Ted Yu



            yuzhihong@gmail.com Ted Yu
            vishk Vishal Khandelwal
            0 Vote for this issue
            7 Start watching this issue