Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-9726

Proposal for a new master design for assignment

    XMLWordPrintableJSON

Details

    • Brainstorming
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • master
    • None
    • Hide
      Current assignment process (also split process) relies on ZK for the communication between master and regionserver. This pattern has two drawbacks:
        1. For cluster with big number of regions(say, 10K-100K regions), ZK becomes the bottleneck for cluster restart since the assignment/split status/progress is stored in ZK due to ZK's limited write throughput
        2. Since ZK's watch is one-time and the event notification/process is asynchronous, there is no guarantee for master(the watcher) to be notified of the up-to-date status/progress in time, thereby master relies on idempotence for its correctness, which makes the logic/code very hard to understand/maintain

      A new assignment design proposal is as below:
        1. Assignment/split status/progress is stored in a system table(say 'assignTable') as meta table rather than ZK to improve the write throughput, hence to improve the proformance of restart for cluster with large number of regions.
        2. The communication pattern for assignment/split is changed this way: master talks directly with regionserver(master issues assign request to regionserver, regionserver responses the assign progress to master) and records the status/progress of each assignment/split in the 'assignTable', in case of master failure, new active master reads the 'assignTable' to rebuilds the knowledge of the ongoing assignmeng/split tasks and continues from that knowledge. (regionserver doesn't write to the 'assignTable')

      Raise the initial proposal for discussion using this JIRA. We can figure out and present the detailed state machine if no objection.
      Show
      Current assignment process (also split process) relies on ZK for the communication between master and regionserver. This pattern has two drawbacks:   1. For cluster with big number of regions(say, 10K-100K regions), ZK becomes the bottleneck for cluster restart since the assignment/split status/progress is stored in ZK due to ZK's limited write throughput   2. Since ZK's watch is one-time and the event notification/process is asynchronous, there is no guarantee for master(the watcher) to be notified of the up-to-date status/progress in time, thereby master relies on idempotence for its correctness, which makes the logic/code very hard to understand/maintain A new assignment design proposal is as below:   1. Assignment/split status/progress is stored in a system table(say 'assignTable') as meta table rather than ZK to improve the write throughput, hence to improve the proformance of restart for cluster with large number of regions.   2. The communication pattern for assignment/split is changed this way: master talks directly with regionserver(master issues assign request to regionserver, regionserver responses the assign progress to master) and records the status/progress of each assignment/split in the 'assignTable', in case of master failure, new active master reads the 'assignTable' to rebuilds the knowledge of the ongoing assignmeng/split tasks and continues from that knowledge. (regionserver doesn't write to the 'assignTable') Raise the initial proposal for discussion using this JIRA. We can figure out and present the detailed state machine if no objection.

    Attachments

      Issue Links

        Activity

          People

            Unassigned Unassigned
            fenghh Honghua Feng
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: