[HBASE-12028] Abort the RegionServer, when it's handler threads die - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0, 1.1.0
Component/s: regionserver
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Adds a configuration property "hbase.regionserver.handler.abort.on.error.percent" for aborting the region server when some of it's handler threads die. The default value is 0.5 causing an abort in the RS when half of it's handler threads die. A handler thread only dies in case of a serious software bug, or a non-recoverable Error (StackOverflow, OOM, etc) is thrown.
These are possible values for the configuration:
   * -1 => Disable aborting
   * 0 => Abort if even a single handler has died
   * 0.x => Abort only when this percent of handlers have died
   * 1 => Abort only all of the handers have died

Show
Adds a configuration property "hbase.regionserver.handler.abort.on.error.percent" for aborting the region server when some of it's handler threads die. The default value is 0.5 causing an abort in the RS when half of it's handler threads die. A handler thread only dies in case of a serious software bug, or a non-recoverable Error (StackOverflow, OOM, etc) is thrown. These are possible values for the configuration:    * -1 => Disable aborting    * 0 => Abort if even a single handler has died    * 0.x => Abort only when this percent of handlers have died    * 1 => Abort only all of the handers have died

Description

Over in HBase-11813, a user identified an issue where in all the RPC handler threads would exit with StackOverflow errors due to an unchecked recursion-terminating condition. Our clusters demonstrated the same trace. While the patch posted for ~~HBASE-11813~~ got our clusters to be merry again, the breakdown surfaced some larger issues.

When the RegionServer had all it's RPC handler threads dead, it continued to have regions assigned it. Clearly, it wouldn't be able to serve reads and writes on those regions. A second issue was that when a user tried to disable or drop a table, the master would try to communicate to the regionserver for region unassignment. Since the same handler threads seem to be used for master <-> RS communication as well, the master ended up hanging on the RS indefinitely. Eventually, the master stopped responding to all table meta-operations.

A handler thread should never exit, and if it does, it seems like the more prudent thing to do would be for the RS to abort. This way, at least recovery can be undertaken and the regions could be reassigned elsewhere. I also think that the master<->RS communication should get its own exclusive threadpool, but I'll wait until this issue has been sufficiently discussed before opening an issue ticket for that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Hbase-12028.patch
07/Dec/14 23:09
30 kB
Alicia Ying Shu
Hbase-12028-v3.patch
10/Dec/14 02:05
31 kB
Alicia Ying Shu
hbase-12028-v4.patch
17/Dec/14 20:26
32 kB
Alicia Ying Shu
hbase-12028-v5.patch
29/Dec/14 23:19
32 kB
Alicia Ying Shu
hbase-12028-v5-branch-1.patch
02/Jan/15 23:31
30 kB
Enis Soztutar
hbase-12028-v5-master.patch
02/Jan/15 23:31
30 kB
Enis Soztutar

Issue Links

causes

HBASE-25198 Remove RpcSchedulerFactory#create(Configuration, PriorityFunction)

Resolved

is related to

HBASE-11813 CellScanner#advance may overflow stack

Closed

HBASE-12788 Promote Abortable to LimitedPrivate

Closed

relates to

HBASE-12200 When an RPC server handler thread dies, throw exception

Closed

HBASE-12787 Backport HBASE-12028 (Abort the RegionServer when it's handler threads die) to 0.98

Closed

Activity

People

Assignee:: Alicia Ying Shu

Reporter:: Sudarshan Kadambi

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 19/Sep/14 15:34

Updated:: 18/Oct/20 20:04

Resolved:: 02/Jan/15 23:23