HBase
  1. HBase
  2. HBASE-2692

Master rewrite and cleanup for 0.90

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is the parent issue for master changes targeted at 0.90 release.

      Changes done as part of this issue grew out of work done over in HBASE-2485 to move region transitions into ZK.

      In addition to that work, this issue will include general HMaster and ZooKeeper refactorings and cleanups.

      1. book.pdf
        36 kB
        stack
      2. 2692.patch
        1.48 MB
        stack
      3. BRANCH_NOTES.txt
        11 kB
        Jonathan Gray

        Issue Links

          Activity

          Hide
          Jonathan Gray added a comment -

          Created a feature branch for work related to this JIRA.

          https://svn.apache.org/repos/asf/hbase/branches/0.90_master_rewrite

          As part of that branch, I added a BRANCH_CHANGES.txt for changes just in the branch. On commit messages, I am prefixing with [0.90_master_rewrite].

          Show
          Jonathan Gray added a comment - Created a feature branch for work related to this JIRA. https://svn.apache.org/repos/asf/hbase/branches/0.90_master_rewrite As part of that branch, I added a BRANCH_CHANGES.txt for changes just in the branch. On commit messages, I am prefixing with [0.90_master_rewrite] .
          Hide
          Jonathan Gray added a comment -

          This is a text file containing information about all types of region transitions including how to deal with failures of both Masters and RegionServers.

          Also contains the set of valid unassigned ZK node transitions. No more strong-handedly doing createOrUpdate everywhere.

          Show
          Jonathan Gray added a comment - This is a text file containing information about all types of region transitions including how to deal with failures of both Masters and RegionServers. Also contains the set of valid unassigned ZK node transitions. No more strong-handedly doing createOrUpdate everywhere.
          Hide
          Jonathan Gray added a comment -

          Note that the file does not currently address catalog tables as special in any way, though they are. I'm not quite there yet but will add it to the doc once that is fleshed out.

          Show
          Jonathan Gray added a comment - Note that the file does not currently address catalog tables as special in any way, though they are. I'm not quite there yet but will add it to the doc once that is fleshed out.
          Hide
          Jean-Daniel Cryans added a comment -

          My first pass on the document, would be nice if the doc was on rb.

          - Master startup determines whether this is startup or failover by counting the number of RS nodes in ZK.

          FWIW there's also a znode called "shutdown" that, if present, means that the cluster is up and running.

          - Master waits for the minimum number of RS to be available to be assigned regions.

          I'd be interested to see your thoughts about determining the minimum number of RSs. Also how do you handle the assignment of ROOT and META? What's the chance that there's always enough RSs that reported in once both catalog regions are assigned?

          Periodically, and when there are not any regions in transition, a load
          balancer will run and move regions around to balance cluster load

          It's not clear to me why you have to wait. Then you wrote:

          - Load balancer blocks until there are no regions in transition.

          So once the load balancer is started, do we block regions going in transition?

          - Master sends RPCs to the source RSs, telling them to CLOSE the regions.

          What happens if a region splits while load balancing is running?

          - Master finds all regions of the table.

          Same question, what do you do if a region splits just after that? Won't you be missing 2 regions and giving a CLOSE rpc that's invalid?

          Show
          Jean-Daniel Cryans added a comment - My first pass on the document, would be nice if the doc was on rb. - Master startup determines whether this is startup or failover by counting the number of RS nodes in ZK. FWIW there's also a znode called "shutdown" that, if present, means that the cluster is up and running. - Master waits for the minimum number of RS to be available to be assigned regions. I'd be interested to see your thoughts about determining the minimum number of RSs. Also how do you handle the assignment of ROOT and META? What's the chance that there's always enough RSs that reported in once both catalog regions are assigned? Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load It's not clear to me why you have to wait. Then you wrote: - Load balancer blocks until there are no regions in transition. So once the load balancer is started, do we block regions going in transition? - Master sends RPCs to the source RSs, telling them to CLOSE the regions. What happens if a region splits while load balancing is running? - Master finds all regions of the table. Same question, what do you do if a region splits just after that? Won't you be missing 2 regions and giving a CLOSE rpc that's invalid?
          Hide
          stack added a comment -

          .bq "Master waits for the minimum number of RS to be available to be assigned regions"

          How long you going to wait? Whats the trigger that has you start assigning?

          .bq "We assume that the Master will not fail until after the OFFLINE nodes have been created in ZK"

          And if it does? Is that for later?

          .bq "Load balancer blocks until there are no regions in transition"

          Doesn't just go back to sleep?

          .bq - Master sends RPCs to each RS, telling them to OPEN their regions.

          This is a new message?

          .bq Master removes the region from all in-memory structures

          Who updated .META.? The RS, right?

          .bq "3. Table Enable/Disable"

          This has to be better than what we currently have?

          .bq "Master determines which regions need to be handled."

          How?

          .bq Master sends RPCs to RSs to open all the regions.

          Whats this message look like? Its the new one mentioned above?

          Will it take time figuring what was on the dead RS? Or will it be near-instantaneous?

          .bq " 4. RegionServer Failure"

          I don't see mention of wait on RS WAL log spitting in here. Should it be mentioned?

          This is a nice doc. Jon. I see it sticking around long after master rewrite. We'll point fellas here when they are trying to figure how this all worky-poos.

          Show
          stack added a comment - .bq "Master waits for the minimum number of RS to be available to be assigned regions" How long you going to wait? Whats the trigger that has you start assigning? .bq "We assume that the Master will not fail until after the OFFLINE nodes have been created in ZK" And if it does? Is that for later? .bq "Load balancer blocks until there are no regions in transition" Doesn't just go back to sleep? .bq - Master sends RPCs to each RS, telling them to OPEN their regions. This is a new message? .bq Master removes the region from all in-memory structures Who updated .META.? The RS, right? .bq "3. Table Enable/Disable" This has to be better than what we currently have? .bq "Master determines which regions need to be handled." How? .bq Master sends RPCs to RSs to open all the regions. Whats this message look like? Its the new one mentioned above? Will it take time figuring what was on the dead RS? Or will it be near-instantaneous? .bq " 4. RegionServer Failure" I don't see mention of wait on RS WAL log spitting in here. Should it be mentioned? This is a nice doc. Jon. I see it sticking around long after master rewrite. We'll point fellas here when they are trying to figure how this all worky-poos.
          Hide
          Jonathan Gray added a comment -

          I'd be interested to see your thoughts about determining the minimum number of RSs.

          How long you going to wait? Whats the trigger that has you start assigning?

          This is a configuration parameter that already exists. We could also add some kind of timeout (once one RS checks in, we wait N seconds before starting up).

          Also how do you handle the assignment of ROOT and META? What's the chance that there's always enough RSs that reported in once both catalog regions are assigned?

          As I said, this doc leaves out catalog tables for now. One thing to remember, startup is no longer going to need to wait around for the meta scanner to kick in and region assignment should be significantly faster, so it could be potentially sub-second to assign them out.

          So once the load balancer is started, do we block regions going in transition?

          Well sometimes you can't. The idea is that load balancing runs when the cluster is at steady state.

          [if regions in transiton] Doesn't just go back to sleep?

          It could. Might be simpler approach except if you set a long period you'd potentially skip a load balance because of a single region in transition.

          The load balancer will probably be the last thing I add, still some details to flesh out. I think it makes sense to not kick off a load balancing when there are already regions in transition (it makes sense as a background operation when the cluster is mostly at steady state from region movement standpoint). Whether it waits or skips the run is an open question, I'd opt for waiting. The next run of the load balancer would happen at the fixed period, maybe starting from the end of the previous run (rather than strictly periodic).

          What happens if a region splits while load balancing is running?

          Good question. Will try to address splits in the next revision. They are largely ignored in this one as is meta editing in general. Let me give that a crack.

          "We assume that the Master will not fail until after the OFFLINE nodes have been created in ZK" And if it does? Is that for later?

          Later, perhaps. Does not seem unreasonable that the Master can't fail during the first minute of the initial cluster startup but I suppose it could eventually be handled. In any case, I think this is okay for now.

          Master sends RPCs to each RS, telling them to OPEN their regions. This is a new message?

          Yup. There'll be a new OPEN and CLOSE RPC to RS. In one of my earlier branches for this I moved cluster admin methods to be direct RPC to RS as well but that's outside the scope of this for now.

          Who updated .META.? The RS, right?

          That is the plan. I will address META editing and splits in the next revision of the doc.

          [load balancing] This has to be better than what we currently have?

          Yes, this should be better (hard to be worse). Feasible to use ZK to make a synchronous client call as well (and could look at status of nodes in zk to see the status of regions transitioning).

          "Master determines which regions need to be handled." How?

          In the doc I describe that assignment information is sitting in-memory in the master, but not positive we should use this since a failed-over master may not have this yet. I left out what a failed-over master does to rebuild the assignments in-memory as well. The latter would be done with the meta scanner, the former could as well.

          "Master sends RPCs to RSs to open all the regions." Whats this message look like? Its the new one mentioned above? Will it take time figuring what was on the dead RS? Or will it be near-instantaneous?

          If we have up-to-date in-memory state to work with, its near-instantaneous. Meta scanning not so much. It'd be nice to be able to use the info in-memory if it is available, and if not complete (we're a failed-over master that hasn't fully recovered yet) we'd fall-back to meta scanning.

          I don't see mention of wait on RS WAL log spitting in here. Should it be mentioned?

          Yup. Good call.

          Thanks for reviewing it guys. Not sure I'll have time this week/weekend to do a refresh but should have a revised one up sometime Monday.

          Show
          Jonathan Gray added a comment - I'd be interested to see your thoughts about determining the minimum number of RSs. How long you going to wait? Whats the trigger that has you start assigning? This is a configuration parameter that already exists. We could also add some kind of timeout (once one RS checks in, we wait N seconds before starting up). Also how do you handle the assignment of ROOT and META? What's the chance that there's always enough RSs that reported in once both catalog regions are assigned? As I said, this doc leaves out catalog tables for now. One thing to remember, startup is no longer going to need to wait around for the meta scanner to kick in and region assignment should be significantly faster, so it could be potentially sub-second to assign them out. So once the load balancer is started, do we block regions going in transition? Well sometimes you can't. The idea is that load balancing runs when the cluster is at steady state. [if regions in transiton] Doesn't just go back to sleep? It could. Might be simpler approach except if you set a long period you'd potentially skip a load balance because of a single region in transition. The load balancer will probably be the last thing I add, still some details to flesh out. I think it makes sense to not kick off a load balancing when there are already regions in transition (it makes sense as a background operation when the cluster is mostly at steady state from region movement standpoint). Whether it waits or skips the run is an open question, I'd opt for waiting. The next run of the load balancer would happen at the fixed period, maybe starting from the end of the previous run (rather than strictly periodic). What happens if a region splits while load balancing is running? Good question. Will try to address splits in the next revision. They are largely ignored in this one as is meta editing in general. Let me give that a crack. "We assume that the Master will not fail until after the OFFLINE nodes have been created in ZK" And if it does? Is that for later? Later, perhaps. Does not seem unreasonable that the Master can't fail during the first minute of the initial cluster startup but I suppose it could eventually be handled. In any case, I think this is okay for now. Master sends RPCs to each RS, telling them to OPEN their regions. This is a new message? Yup. There'll be a new OPEN and CLOSE RPC to RS. In one of my earlier branches for this I moved cluster admin methods to be direct RPC to RS as well but that's outside the scope of this for now. Who updated .META.? The RS, right? That is the plan. I will address META editing and splits in the next revision of the doc. [load balancing] This has to be better than what we currently have? Yes, this should be better (hard to be worse). Feasible to use ZK to make a synchronous client call as well (and could look at status of nodes in zk to see the status of regions transitioning). "Master determines which regions need to be handled." How? In the doc I describe that assignment information is sitting in-memory in the master, but not positive we should use this since a failed-over master may not have this yet. I left out what a failed-over master does to rebuild the assignments in-memory as well. The latter would be done with the meta scanner, the former could as well. "Master sends RPCs to RSs to open all the regions." Whats this message look like? Its the new one mentioned above? Will it take time figuring what was on the dead RS? Or will it be near-instantaneous? If we have up-to-date in-memory state to work with, its near-instantaneous. Meta scanning not so much. It'd be nice to be able to use the info in-memory if it is available, and if not complete (we're a failed-over master that hasn't fully recovered yet) we'd fall-back to meta scanning. I don't see mention of wait on RS WAL log spitting in here. Should it be mentioned? Yup. Good call. Thanks for reviewing it guys. Not sure I'll have time this week/weekend to do a refresh but should have a revised one up sometime Monday.
          Hide
          stack added a comment -

          Here's the patch I committed.

          Here's a bit of a change log:

          HBASE-2692 Master rewrite and cleanup for 0.90
          
          Patch brought over from 0.90_master_rewrite branch.
          
          Replication test is broke as are some of the rest tests.
          Others should be passing.
          
          Some of the changes made in this fat patch:
          
          + In HLogKey, we now use encoded region name instead of full region name.
          + On split, daughters are opened on the parent's regionserver; let the new balancer
          sort them out later when it cuts in.
          + Added move region from one server to another as well as enable/disable balancer.
          + All .META. and -ROOT- edits go via new *Editor and *Reader classes -- no more
          do we have 5 different ways of reading and editing .META.
          + Rather than 3 different listeners to hlog each w/ own way of listening, instead
          we only have WALObserver now.
          + New Server Interface that has whats common to HMaster and RegionServer. Also
          new Services Interface.  This should make test writing cleaner making it so
          less need of full cluster context testing anything -- e.g. the new
          Interfaces are good w/ Mockito.
          + New balacner that runs on a period and takes into consideration all load
          across cluster.
          + Table online/offline is now a flag in ZK; the offline flag on a region is
          just used splitting from here on out.
          + Moved fixup of failed add of daughter edits to .META. into shutdown server
          recover code (It used to be in basescanner).
          + The heartbeat now sends master the regionserver load and is used sending
          shutdown message from master to regionserver ONLY; all other messages are
          via zk (HMsg is pretty bare now).
          + No more Worker in RS and ToDoQueue in master.  Both in master and regionserver
          we use handlers instead run out of Executors.
          + Client can not send split, flush, compact direct to RS; no longer does
          it go via master.
          + Server shutdown runs differently now. All are watching a flag in zk.
          When RS notices its gone, it closes all user-space regions. If thats all
          it was carrying, then it goes down.  Otherwise, waits on master to send
          the shutdown msg via heartbeat.
          
          
          Show
          stack added a comment - Here's the patch I committed. Here's a bit of a change log: HBASE-2692 Master rewrite and cleanup for 0.90 Patch brought over from 0.90_master_rewrite branch. Replication test is broke as are some of the rest tests. Others should be passing. Some of the changes made in this fat patch: + In HLogKey, we now use encoded region name instead of full region name. + On split, daughters are opened on the parent's regionserver; let the new balancer sort them out later when it cuts in. + Added move region from one server to another as well as enable/disable balancer. + All .META. and -ROOT- edits go via new *Editor and *Reader classes -- no more do we have 5 different ways of reading and editing .META. + Rather than 3 different listeners to hlog each w/ own way of listening, instead we only have WALObserver now. + New Server Interface that has whats common to HMaster and RegionServer. Also new Services Interface. This should make test writing cleaner making it so less need of full cluster context testing anything -- e.g. the new Interfaces are good w/ Mockito. + New balacner that runs on a period and takes into consideration all load across cluster. + Table online/offline is now a flag in ZK; the offline flag on a region is just used splitting from here on out. + Moved fixup of failed add of daughter edits to .META. into shutdown server recover code (It used to be in basescanner). + The heartbeat now sends master the regionserver load and is used sending shutdown message from master to regionserver ONLY; all other messages are via zk (HMsg is pretty bare now). + No more Worker in RS and ToDoQueue in master. Both in master and regionserver we use handlers instead run out of Executors. + Client can not send split, flush, compact direct to RS; no longer does it go via master. + Server shutdown runs differently now. All are watching a flag in zk. When RS notices its gone, it closes all user-space regions. If thats all it was carrying, then it goes down. Otherwise, waits on master to send the shutdown msg via heartbeat.
          Hide
          stack added a comment -

          Committed. Thanks for the work Jon and Karthik.

          Show
          stack added a comment - Committed. Thanks for the work Jon and Karthik.
          Hide
          Jonathan Gray added a comment -

          This is a list of JIRAs I compiled some time ago that are in some part related to HBASE-2692 and related changes done in the master rewrite branch.

          HBASE-1816 Master rewrite
          HBASE-1750 Region opens and assignment running independent of shutdown processing
          HBASE-1439 race between master and regionserver after missed heartbeat
          HBASE-2405 Close, split, open of regions in RegionServer are run by a single thread only.
          HBASE-2465 HMaster should not contact each RS on startup
          HBASE-2464 Shutdown of master doesn't succeed if root isn't assigned
          HBASE-2391 TableServers.isMasterRunning() skips checking master status if master field is not null
          HBASE-2240 Balancer should not reassign (because of overloading) a region that was just opened
          HBASE-2238 Review all transitions – compactions, splits, region opens, log splitting – for crash-proofyness and atomicity
          HBASE-2235 Mechanism that would not have ROOT and .META. on same server caused failed assign of .META.
          HBASE-2228 Region close needs to be fast; e.g. if compacting, abandon it
          HBASE-1992 We give up reporting splits if .META. is offline or unreachable (handlers at their maximum)
          HBASE-1909 Balancer assigns to most loaded server when assigning a single region
          HBASE-1873 Race condition around HRegionServer -> HMaster communication
          HBASE-1851 Broken master failover
          HBASE-1755 Putting 'Meta' table into ZooKeeper
          HBASE-1746 Starting 5 Masters on fresh install, a few fail in race to create hbase.version
          HBASE-1742 Region lost (disabled) when ROOT offline or hosting server dies just before it tells master successful open
          HBASE-1720 Why we have hbase.master.lease.period still in our logs when HRS goes to ZK now?
          HBASE-1711 META not cleaned up after table deletion/truncation
          HBASE-1700 Regionserver should be parsimonious regards messages sent master
          HBASE-1676 load balancing on a large cluster doesn't work very well
          HBASE-1666 'safe mode' is broken; MetaScanner initial scan completes without scanning .META.
          HBASE-1502 Remove need for heartbeats in HBase
          HBASE-1463 ROOT reassignment provokes .META. reassignment though .META. is sitting pretty
          HBASE-1461 region assignment can become stale and master doesnt fix it
          HBASE-1451 Redo master management of state transitions coalescing and keeping transition state over in ZK
          HBASE-1419 Assignment and balance need to coordinate; assigner gives out up to 10 regions at a time unbalancing a balanced cluster making for churn, etc.
          HBASE-934 Assigning all regions to one server only
          HBASE-869 On split, if failure updating of .META., table subsequently broke
          HBASE-57 [hbase] Master should allocate regions to regionservers based upon data locality and rack awareness

          Show
          Jonathan Gray added a comment - This is a list of JIRAs I compiled some time ago that are in some part related to HBASE-2692 and related changes done in the master rewrite branch. HBASE-1816 Master rewrite HBASE-1750 Region opens and assignment running independent of shutdown processing HBASE-1439 race between master and regionserver after missed heartbeat HBASE-2405 Close, split, open of regions in RegionServer are run by a single thread only. HBASE-2465 HMaster should not contact each RS on startup HBASE-2464 Shutdown of master doesn't succeed if root isn't assigned HBASE-2391 TableServers.isMasterRunning() skips checking master status if master field is not null HBASE-2240 Balancer should not reassign (because of overloading) a region that was just opened HBASE-2238 Review all transitions – compactions, splits, region opens, log splitting – for crash-proofyness and atomicity HBASE-2235 Mechanism that would not have ROOT and .META. on same server caused failed assign of .META. HBASE-2228 Region close needs to be fast; e.g. if compacting, abandon it HBASE-1992 We give up reporting splits if .META. is offline or unreachable (handlers at their maximum) HBASE-1909 Balancer assigns to most loaded server when assigning a single region HBASE-1873 Race condition around HRegionServer -> HMaster communication HBASE-1851 Broken master failover HBASE-1755 Putting 'Meta' table into ZooKeeper HBASE-1746 Starting 5 Masters on fresh install, a few fail in race to create hbase.version HBASE-1742 Region lost (disabled) when ROOT offline or hosting server dies just before it tells master successful open HBASE-1720 Why we have hbase.master.lease.period still in our logs when HRS goes to ZK now? HBASE-1711 META not cleaned up after table deletion/truncation HBASE-1700 Regionserver should be parsimonious regards messages sent master HBASE-1676 load balancing on a large cluster doesn't work very well HBASE-1666 'safe mode' is broken; MetaScanner initial scan completes without scanning .META. HBASE-1502 Remove need for heartbeats in HBase HBASE-1463 ROOT reassignment provokes .META. reassignment though .META. is sitting pretty HBASE-1461 region assignment can become stale and master doesnt fix it HBASE-1451 Redo master management of state transitions coalescing and keeping transition state over in ZK HBASE-1419 Assignment and balance need to coordinate; assigner gives out up to 10 regions at a time unbalancing a balanced cluster making for churn, etc. HBASE-934 Assigning all regions to one server only HBASE-869 On split, if failure updating of .META., table subsequently broke HBASE-57 [hbase] Master should allocate regions to regionservers based upon data locality and rack awareness
          Hide
          stack added a comment -

          Here is generated hbase book with Jon's notes on region transitions (see chapter 9).

          Show
          stack added a comment - Here is generated hbase book with Jon's notes on region transitions (see chapter 9).
          Hide
          stack added a comment -

          Thanks for the above list Jon. Helped. Most got closed. I left open the questionable and the issues we need a unit test first before we can close.

          Show
          stack added a comment - Thanks for the above list Jon. Helped. Most got closed. I left open the questionable and the issues we need a unit test first before we can close.

            People

            • Assignee:
              Jonathan Gray
              Reporter:
              Jonathan Gray
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development