HBase
  1. HBase
  2. HBASE-2485

Persist Master in-memory state so on restart or failover, new instance can pick up where the old left off

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None

      Description

      Today there was some good stuff up on IRC on how transitions won't always make it across Master failovers in multi-master deploy because transitions are kept in in-memory structure up in the Master and so on master crash, the new master will be missing state on startup (Todd was main promulgator of this observation and of the opinion that while master rewrite is scheduled for 0.21, some part needs to be done for 0.20.5). A few suggestions were made: transitions should be file-backed somehow, etc. Let this issue be about the subset we want to do for 0.20.5.

      Of the in-memory state queues, there is at least the master tasks queue – process region opens, closes, regionserver crashes, etc. – where tasks must be done in order and IIRC, tasks are fairly idempotent (at least in the server crash case, its multi-step and we'll put the crash event back on the queue if we cannot do all steps in the one go). Perhaps this queue could be done using the new queue facility in zk 3.3.0 (I haven't looked to check if possible, just suggesting). Another suggestion was a file to which we'd append queue items, requeueing, and marking the file with task complete, etc. On Master restart or fail-over, we'd replay the queue log.

      There is also the Map of regions-in-transition. Yesterday we learned that there is a bug where server shutdown processing does not iterate the Map of regions-in-transition. This Map may hold regions that are in "opening" or "opened" state but haven't yet had the fact added to .META. by master. Meantime the hosting server can crash. Regions that were opening will stay in the regions-in-transition and those in opened-but-not-yet-added-to-meta will go ahead and add a crashed server to .META. (Currently regions-in-transition does not record server the region opening/open is happening on so it doesn't have enough info to be processed as part of server shutdown).

      Regions-in-transition also needs to be persistant. On startup, regions-in-transition can get kinda hectic on a big cluster. Ordering is not so important here I believe. A directory in zk might work (For 1M regions in a big cluster, that'd be about 2M creates and 2M deletes during startup – thats too much?). Or we could write a WAL-like log again of region transitions (We'd have to develop a little vocabulary) that got reread by a new master.

        Issue Links

          Activity

          Hide
          stack added a comment -

          Just to say that the attached is good on how things used to work. It also puts up a few simple axioms on how things are to be in the new master with listings of general transition flows. The hbase 'book' has the committed versions of these flows. I also took from the doc. description of how splits are now in new master.

          Show
          stack added a comment - Just to say that the attached is good on how things used to work. It also puts up a few simple axioms on how things are to be in the new master with listings of general transition flows. The hbase 'book' has the committed versions of these flows. I also took from the doc. description of how splits are now in new master.
          Hide
          Jonathan Gray added a comment -

          Newly committed master failover unit tests all passing on hudson. Resolving!

          Show
          Jonathan Gray added a comment - Newly committed master failover unit tests all passing on hudson. Resolving!
          Hide
          Jonathan Gray added a comment -

          Working on unit tests for this over the course of this week.

          Show
          Jonathan Gray added a comment - Working on unit tests for this over the course of this week.
          Hide
          stack added a comment -

          When we write unit test to demonstrate this issue fixed, be sure to include coverage for the case described over inHBASE-1742 where an open comes in during master down and it causes loss of region onlining.

          Show
          stack added a comment - When we write unit test to demonstrate this issue fixed, be sure to include coverage for the case described over inHBASE-1742 where an open comes in during master down and it causes loss of region onlining.
          Hide
          stack added a comment -

          Yes. This needs a test and we need to add Karthik's document to the hbase book before we can close this issue I'd say.

          Show
          stack added a comment - Yes. This needs a test and we need to add Karthik's document to the hbase book before we can close this issue I'd say.
          Hide
          Jonathan Gray added a comment -

          Implementation of this moved into HBASE-2692. We should add some tests before closing this perhaps.

          Show
          Jonathan Gray added a comment - Implementation of this moved into HBASE-2692 . We should add some tests before closing this perhaps.
          Hide
          stack added a comment -

          Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.

          Show
          stack added a comment - Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.
          Hide
          Karthik Ranganathan added a comment -

          Hey Stack,

          Excellent feedback, thanks!

          1. Will add a modified doc soon, absolutely agree with you comment:
          "intent is moving the intransitions out of Master to zk so intransiitions weathers a master restart."
          2. wrt Master restarts: jgray and I were discussing, it will be a scheme similar to zk/UNASSIGNED, but in a different location. And a RS will be handed a bunch of regions using one zk node update, and will have to ack the bulk open in one zk node update. Will fill in once the details are clearer, but it will not follow the exact same scheme.
          3. Yes, closing is of no use.
          4. Agreed
          5. Yes, opening is a nice to have. I am taking the following approach: let the RS report opening progress, but master will ignore them for the first cut.
          6. No, I don't think that would be needed in the current scheme. The RS would just update the state of the region to "OPENED" and master can infer from there.

          We have already started coding some parts, will update once there is more progress...

          Show
          Karthik Ranganathan added a comment - Hey Stack, Excellent feedback, thanks! 1. Will add a modified doc soon, absolutely agree with you comment: "intent is moving the intransitions out of Master to zk so intransiitions weathers a master restart." 2. wrt Master restarts: jgray and I were discussing, it will be a scheme similar to zk/UNASSIGNED, but in a different location. And a RS will be handed a bunch of regions using one zk node update, and will have to ack the bulk open in one zk node update. Will fill in once the details are clearer, but it will not follow the exact same scheme. 3. Yes, closing is of no use. 4. Agreed 5. Yes, opening is a nice to have. I am taking the following approach: let the RS report opening progress, but master will ignore them for the first cut. 6. No, I don't think that would be needed in the current scheme. The RS would just update the state of the region to "OPENED" and master can infer from there. We have already started coding some parts, will update once there is more progress...
          Hide
          stack added a comment -

          Doc is great. Here's a few comments.

          + I think you should start your proposal w/ some high-level intents: e.g. Only messages from Master to RS over RPC are of import, are "commands"; messages from RS to Master are just informational (load, split) OR, intent is moving the intransitions out of Master to zk so intransiitions weathers a master restart.
          + Startup could be tricky. Here we are hoisting all regions in .META. up into the unassigned in zk. I was wondering about the case where the copy from .META. to zk/UNASSIGNED is only partially done say because master crashes. What happens? Maybe it'll be OK? If the meta startcode does not match that of a running regionserver, then the region has not yet been assigned so add it to zk/UNASSIGNED.
          + In Close Region RS Flow, did we agree closing is of no use? There is nothing master can do really if closing is taking for ever?
          + Up in zk, unfortunately, znodes will have to be named using the regions encoded name. Will make it a little tough following region flow. Perhaps the fix is to make encoded name of a region more prevalent in logs.
          + We said opening was nice to have rather than necessary?
          + I wonder if you need a new message from Master to RS where you can ask the RS what regions it has deployed? Be best if we didn't need it. We shouldn't need it I suppose.

          Show
          stack added a comment - Doc is great. Here's a few comments. + I think you should start your proposal w/ some high-level intents: e.g. Only messages from Master to RS over RPC are of import, are "commands"; messages from RS to Master are just informational (load, split) OR, intent is moving the intransitions out of Master to zk so intransiitions weathers a master restart. + Startup could be tricky. Here we are hoisting all regions in .META. up into the unassigned in zk. I was wondering about the case where the copy from .META. to zk/UNASSIGNED is only partially done say because master crashes. What happens? Maybe it'll be OK? If the meta startcode does not match that of a running regionserver, then the region has not yet been assigned so add it to zk/UNASSIGNED. + In Close Region RS Flow, did we agree closing is of no use? There is nothing master can do really if closing is taking for ever? + Up in zk, unfortunately, znodes will have to be named using the regions encoded name. Will make it a little tough following region flow. Perhaps the fix is to make encoded name of a region more prevalent in logs. + We said opening was nice to have rather than necessary? + I wonder if you need a new message from Master to RS where you can ask the RS what regions it has deployed? Be best if we didn't need it. We shouldn't need it I suppose.
          Hide
          Karthik Ranganathan added a comment -

          Adding an initial design of how to handle region transitions.

          Show
          Karthik Ranganathan added a comment - Adding an initial design of how to handle region transitions.

            People

            • Assignee:
              Jonathan Gray
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development