[HBASE-20137] TestRSGroups is flakey - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 2.0.0-beta-2
Fix Version/s: None
Component/s: test
Labels:
None

Description

It was the single test that failed the hbase-2 nightlies in #440 at the hadoop2 stage.

The failure manifests as a timeout. It actually has an interesting cause calling into question some of the clauses in UnassignProcedure#remoteCallFailed.

We are running a disabletable concurrent with a shutdown. pid=309 is the disable. pid=311 is the interesting one. The below is a little hard to read – the exception 'message' is the the current procedure as a String... hard to parse, fixing – but we are trying to unassign as part of a the disabletable. Our RPC fails because the server we are trying to rpc too is currently being processed as crashed (pid=308 is a servercrashprocedure for this server). As part of the processing of the failed RPC we will expire the server – if we can't RPC to it, it must be gone. The current procedure is then suspended until it gets woken up by the servercrashprocedure triggered by the expire.... only in this case we are shutting down so the expire is ignored... The current procedure is left in its suspend state. This prevents the Master going down. So we time out.

2018-03-05 11:29:22,507 INFO [PEWorker-13] assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN [PEWorker-13] assignment.RegionTransitionProcedure(187): Remote call failed pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN [PEWorker-13] assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524, exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580): Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in progress

I need to cater for case where the expire server is rejected.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-20137.branch-2.001.patch
06/Mar/18 05:27
10 kB
Michael Stack
HBASE-20137.branch-2.002.patch
06/Mar/18 06:02
13 kB
Michael Stack
HBASE-20137.branch-2.003.patch
06/Mar/18 06:53
13 kB
Michael Stack
HBASE-20137.branch-2.003.patch
06/Mar/18 15:42
13 kB
Michael Stack

Issue Links

relates to

HBASE-20152 [AMv2] DisableTableProcedure versus ServerCrashProcedure

Resolved

links to

Review Board (branch-2)

Sub-Tasks

Fix checkstyle introduced in parent 'TestRSGroups is flakey' in new test additions

Resolved

Michael Stack

Activity

People

Assignee:: Michael Stack

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Mar/18 04:42

Updated:: 07/May/19 16:08

Resolved:: 22/Mar/18 02:36