[HBASE-19515] Region server left in online servers list forever if it went down after registering to master and before creating ephemeral node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Reopened
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Region Assignment
Labels:
None

Description

This one is interesting. It was supposedly fixed long time ago back in ~~HBASE-9593~~ (The issue has same subject as this one) but there was a problem w/ the fix reported later, post-commit, long after the issue was closed. The 'fix' was registering ephemeral node in ZK BEFORE reporting in to the Master for the first time. The problem w/ this approach is that the Master tells the RS what name it should use reporting in. If we register in ZK before we talk to the Master, the name in ZK and the one the RS ends up using could deviate.

In hbase2, we do the right thing registering the ephemeral node after we report to the Master. So, the issue reported in ~~HBASE-9593~~, that a RS that dies between reporting to master and registering up in ZK, stays registered at the Master for ever is back; we'll keep trying to assign it regions. Its a real problem.

That hbase2 has this issue has been suppressed up until now. The test that was written for ~~HBASE-9593~~, TestRSKilledWhenInitializing, is a good test but a little sloppy. It puts up two RSs aborting one only after registering at the Master before posting to ZK. That leaves one healthy server up. It is hosting hbase:meta. This is enough for the test to bluster through. The only assign it does is namespace table. It goes to the hbase:meta server. If the test created a new table and did roundrobin, it'd fail.

After ~~HBASE-18946~~, where we do round robin on table create – a desirable attribute – via the balancer so all is kosher, the test TestRSKilledWhenInitializing now starts to fail because we chose the hobbled server most of the time.

So, this issue is about fixing the original issue properly for hbase2. We don't have a timeout on assign in AMv2, not yet, that might be the fix, or perhaps a double report before we online a server with the second report coming in after ZK goes up (or we stop doing ephemeral nodes for RS up in ZK and just rely on heartbeats....).

Making this a critical issue.

Attachments

Issue Links

duplicates

HBASE-15002 TestRSKilledWhenInitializing is flakey

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Dec/17 19:13

Updated:: 02/Jun/21 14:32