Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-12028

Abort the RegionServer, when it's handler threads die

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0, 1.1.0
    • regionserver
    • None
    • Reviewed
    • Hide
      Adds a configuration property "hbase.regionserver.handler.abort.on.error.percent" for aborting the region server when some of it's handler threads die. The default value is 0.5 causing an abort in the RS when half of it's handler threads die. A handler thread only dies in case of a serious software bug, or a non-recoverable Error (StackOverflow, OOM, etc) is thrown.
      These are possible values for the configuration:
         * -1 => Disable aborting
         * 0 => Abort if even a single handler has died
         * 0.x => Abort only when this percent of handlers have died
         * 1 => Abort only all of the handers have died
      Show
      Adds a configuration property "hbase.regionserver.handler.abort.on.error.percent" for aborting the region server when some of it's handler threads die. The default value is 0.5 causing an abort in the RS when half of it's handler threads die. A handler thread only dies in case of a serious software bug, or a non-recoverable Error (StackOverflow, OOM, etc) is thrown. These are possible values for the configuration:    * -1 => Disable aborting    * 0 => Abort if even a single handler has died    * 0.x => Abort only when this percent of handlers have died    * 1 => Abort only all of the handers have died

    Description

      Over in HBase-11813, a user identified an issue where in all the RPC handler threads would exit with StackOverflow errors due to an unchecked recursion-terminating condition. Our clusters demonstrated the same trace. While the patch posted for HBASE-11813 got our clusters to be merry again, the breakdown surfaced some larger issues.

      When the RegionServer had all it's RPC handler threads dead, it continued to have regions assigned it. Clearly, it wouldn't be able to serve reads and writes on those regions. A second issue was that when a user tried to disable or drop a table, the master would try to communicate to the regionserver for region unassignment. Since the same handler threads seem to be used for master <-> RS communication as well, the master ended up hanging on the RS indefinitely. Eventually, the master stopped responding to all table meta-operations.

      A handler thread should never exit, and if it does, it seems like the more prudent thing to do would be for the RS to abort. This way, at least recovery can be undertaken and the regions could be reassigned elsewhere. I also think that the master<->RS communication should get its own exclusive threadpool, but I'll wait until this issue has been sufficiently discussed before opening an issue ticket for that.

      Attachments

        1. Hbase-12028.patch
          30 kB
          Alicia Ying Shu
        2. Hbase-12028-v3.patch
          31 kB
          Alicia Ying Shu
        3. hbase-12028-v4.patch
          32 kB
          Alicia Ying Shu
        4. hbase-12028-v5.patch
          32 kB
          Alicia Ying Shu
        5. hbase-12028-v5-master.patch
          30 kB
          Enis Soztutar
        6. hbase-12028-v5-branch-1.patch
          30 kB
          Enis Soztutar

        Issue Links

          Activity

            People

              aliciashu Alicia Ying Shu
              skadambi Sudarshan Kadambi
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: