[HBASE-9726] Proposal for a new master design for assignment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Brainstorming
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: master
Labels:
None

Release Note:

Hide
Current assignment process (also split process) relies on ZK for the communication between master and regionserver. This pattern has two drawbacks:
  1. For cluster with big number of regions(say, 10K-100K regions), ZK becomes the bottleneck for cluster restart since the assignment/split status/progress is stored in ZK due to ZK's limited write throughput
  2. Since ZK's watch is one-time and the event notification/process is asynchronous, there is no guarantee for master(the watcher) to be notified of the up-to-date status/progress in time, thereby master relies on idempotence for its correctness, which makes the logic/code very hard to understand/maintain

A new assignment design proposal is as below:
  1. Assignment/split status/progress is stored in a system table(say 'assignTable') as meta table rather than ZK to improve the write throughput, hence to improve the proformance of restart for cluster with large number of regions.
  2. The communication pattern for assignment/split is changed this way: master talks directly with regionserver(master issues assign request to regionserver, regionserver responses the assign progress to master) and records the status/progress of each assignment/split in the 'assignTable', in case of master failure, new active master reads the 'assignTable' to rebuilds the knowledge of the ongoing assignmeng/split tasks and continues from that knowledge. (regionserver doesn't write to the 'assignTable')

Raise the initial proposal for discussion using this JIRA. We can figure out and present the detailed state machine if no objection.

Show
Current assignment process (also split process) relies on ZK for the communication between master and regionserver. This pattern has two drawbacks:   1. For cluster with big number of regions(say, 10K-100K regions), ZK becomes the bottleneck for cluster restart since the assignment/split status/progress is stored in ZK due to ZK's limited write throughput   2. Since ZK's watch is one-time and the event notification/process is asynchronous, there is no guarantee for master(the watcher) to be notified of the up-to-date status/progress in time, thereby master relies on idempotence for its correctness, which makes the logic/code very hard to understand/maintain A new assignment design proposal is as below:   1. Assignment/split status/progress is stored in a system table(say 'assignTable') as meta table rather than ZK to improve the write throughput, hence to improve the proformance of restart for cluster with large number of regions.   2. The communication pattern for assignment/split is changed this way: master talks directly with regionserver(master issues assign request to regionserver, regionserver responses the assign progress to master) and records the status/progress of each assignment/split in the 'assignTable', in case of master failure, new active master reads the 'assignTable' to rebuilds the knowledge of the ongoing assignmeng/split tasks and continues from that knowledge. (regionserver doesn't write to the 'assignTable') Raise the initial proposal for discussion using this JIRA. We can figure out and present the detailed state machine if no objection.

Attachments

Issue Links

is related to

HBASE-5487 Generic framework for Master-coordinated tasks

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Honghua Feng

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Oct/13 12:17

Updated:: 16/Jun/22 18:16

Resolved:: 10/Oct/13 12:31