HBase
  1. HBase
  2. HBASE-5487

Generic framework for Master-coordinated tasks

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.94.0
    • Fix Version/s: None
    • Component/s: master, regionserver, Zookeeper
    • Labels:
      None

      Description

      Need a framework to execute master-coordinated tasks in a fault-tolerant manner.

      Master-coordinated tasks such as online-scheme change and delete-range (deleting region(s) based on start/end key) can make use of this framework.

      The advantages of framework are
      1. Eliminate repeated code in Master, ZooKeeper tracker and Region-server for master-coordinated tasks
      2. Ability to abstract the common functions across Master -> ZK and RS -> ZK
      3. Easy to plugin new master-coordinated tasks without adding code to core components

      1. Entity management in Master - part 1.pdf
        138 kB
        Sergey Shelukhin
      2. Entity management in Master - part 1.pdf
        125 kB
        Sergey Shelukhin
      3. hbckMasterV2b-long.pdf
        549 kB
        Jonathan Hsieh
      4. hbckMasterV2-long.pdf
        546 kB
        Jonathan Hsieh
      5. Is the FATE of Assignment Manager FATE.pdf
        225 kB
        Jeffrey Zhong
      6. Region management in Master.pdf
        134 kB
        Sergey Shelukhin
      7. Region management in Master5.docx
        208 kB
        Sergey Shelukhin

        Issue Links

          Activity

          Hide
          stack added a comment -

          I took a look at FATE over in accumulo. Its some nice generic primitives for running a suite of idempotent operations (even if operation only part completes, if its run again, it should clean up and continue). There is a notion of locking on a table (so can stop it transiting I suppose; there are read/write locks), a stack for operations (ops are pushed and popped off the stack), operations can respond done, failed, or even w/ a new set of operations to do first (This basic can be used to step through a number of tasks one after the other). All is persisted up in zk run by the master; if master dies, a new master can pick up the half-done task and finish it. Clients can watch zk to see if task is done. There ain't too much to the fate package; there is fate class itself, an admin, a 'store' interface of which there is a zk implementation. We should for sure take inspiration at least from the work already done.

          Here are the ops they do via fate:

                    fate.seedTransaction(opid, new TraceRepo<Master>(new CreateTable(c.user, tableName, timeType, options)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new RenameTable(tableId, oldTableName, newTableName)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new CloneTable(c.user, srcTableId, tableName, propertiesToSet, propertiesToExclude)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new DeleteTable(tableId)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new ChangeTableState(tableId, TableOperation.ONLINE)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new ChangeTableState(tableId, TableOperation.OFFLINE)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new TableRangeOp(MergeInfo.Operation.MERGE, tableId, startRow, endRow)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new TableRangeOp(MergeInfo.Operation.DELETE, tableId, startRow, endRow)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new BulkImport(tableId, dir, failDir, setTime)), autoCleanup);
                    fate.seedTransaction(opid, new TraceRepo<Master>(new CompactRange(tableId, startRow, endRow)), autoCleanup);
          
          

          CompactRange is their term for merge. It takes a key range span, figures the tablets involved and runs the compact/merge. We want that and then something to do the remove or regions too?

          Show
          stack added a comment - I took a look at FATE over in accumulo. Its some nice generic primitives for running a suite of idempotent operations (even if operation only part completes, if its run again, it should clean up and continue). There is a notion of locking on a table (so can stop it transiting I suppose; there are read/write locks), a stack for operations (ops are pushed and popped off the stack), operations can respond done, failed, or even w/ a new set of operations to do first (This basic can be used to step through a number of tasks one after the other). All is persisted up in zk run by the master; if master dies, a new master can pick up the half-done task and finish it. Clients can watch zk to see if task is done. There ain't too much to the fate package; there is fate class itself, an admin, a 'store' interface of which there is a zk implementation. We should for sure take inspiration at least from the work already done. Here are the ops they do via fate: fate.seedTransaction(opid, new TraceRepo<Master>( new CreateTable(c.user, tableName, timeType, options)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new RenameTable(tableId, oldTableName, newTableName)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new CloneTable(c.user, srcTableId, tableName, propertiesToSet, propertiesToExclude)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new DeleteTable(tableId)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new ChangeTableState(tableId, TableOperation.ONLINE)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new ChangeTableState(tableId, TableOperation.OFFLINE)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new TableRangeOp(MergeInfo.Operation.MERGE, tableId, startRow, endRow)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new TableRangeOp(MergeInfo.Operation.DELETE, tableId, startRow, endRow)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new BulkImport(tableId, dir, failDir, setTime)), autoCleanup); fate.seedTransaction(opid, new TraceRepo<Master>( new CompactRange(tableId, startRow, endRow)), autoCleanup); CompactRange is their term for merge. It takes a key range span, figures the tablets involved and runs the compact/merge. We want that and then something to do the remove or regions too?
          Hide
          Keith Turner added a comment -

          I am an Accumulo developer. CompactRange is our operation to force a range of tablets(regions) to major compact all of their files into one file. The TableRangeOp will merge a range of tablets into one tablet. TableRangeOp can also delete a range of row from a table efficiently. It inserts splits points at the rows you want to delete, drops the tablets, and then merges whats left.

          Show
          Keith Turner added a comment - I am an Accumulo developer. CompactRange is our operation to force a range of tablets(regions) to major compact all of their files into one file. The TableRangeOp will merge a range of tablets into one tablet. TableRangeOp can also delete a range of row from a table efficiently. It inserts splits points at the rows you want to delete, drops the tablets, and then merges whats left.
          Hide
          Keith Turner added a comment -
          Show
          Keith Turner added a comment - The description of FATE given in this ticket is pretty good. The following resources may provide a little more info. http://people.apache.org/~kturner/accumulo14_15.pdf http://mail-archives.apache.org/mod_mbox/incubator-accumulo-dev/201202.mbox/%3CCAGUtCHpcHTDue-C_2RyDkZm0diW=Zojd7-BzCGsZQDTiDznZqg@mail.gmail.com%3E
          Hide
          Enis Soztutar added a comment -

          Keith thanks a lot for the refs, we badly need something like FATE.
          Mubarak, or anybody else plans to work on this?

          Show
          Enis Soztutar added a comment - Keith thanks a lot for the refs, we badly need something like FATE. Mubarak, or anybody else plans to work on this?
          Hide
          Mubarak Seyed added a comment -

          @Enis
          I am not working on this right now. Thanks.

          Show
          Mubarak Seyed added a comment - @Enis I am not working on this right now. Thanks.
          Hide
          Keith Turner added a comment -

          Accumulo 1.3 cannot not survive running our random walk test w/ the agitator (a perl script that kills accumulo processes, it not as devious as Todd's gremlins).

          Random walk + Agitation + Accumulo 1.3 == foobar

          Attempting the above would leave Accumulo in an inconsistent state (like corrupted metadata table) or test clients would die with unexpected exceptions.

          My point is that while developing FATE it was nice to have Random Walk + Agitation to really beat up the FATE framework and the FATE table operations. We also wrote some new random walk test for 1.4 that were even meaner.

          Show
          Keith Turner added a comment - Accumulo 1.3 cannot not survive running our random walk test w/ the agitator (a perl script that kills accumulo processes, it not as devious as Todd's gremlins). Random walk + Agitation + Accumulo 1.3 == foobar Attempting the above would leave Accumulo in an inconsistent state (like corrupted metadata table) or test clients would die with unexpected exceptions. My point is that while developing FATE it was nice to have Random Walk + Agitation to really beat up the FATE framework and the FATE table operations. We also wrote some new random walk test for 1.4 that were even meaner.
          Hide
          Keith Turner added a comment -

          To add context to my above comment, Accumulo 1.3 does not have FATE it was introduced in 1.4.

          Show
          Keith Turner added a comment - To add context to my above comment, Accumulo 1.3 does not have FATE it was introduced in 1.4.
          Hide
          Nick Dimiduk added a comment -

          I'll take a stab at this.

          Show
          Nick Dimiduk added a comment - I'll take a stab at this.
          Hide
          Andrew Purtell added a comment -

          Brownie points if coprocessors can use it too.

          Show
          Andrew Purtell added a comment - Brownie points if coprocessors can use it too.
          Hide
          Jonathan Hsieh added a comment -

          Not sure how this had the noob tag --this sounds like a fairly major undertaking!

          Show
          Jonathan Hsieh added a comment - Not sure how this had the noob tag --this sounds like a fairly major undertaking!
          Hide
          Nick Dimiduk added a comment -

          Andrew Purtell Any chance you could describe in more detail what you have in mind? Say, 250 words, as part of a dependent JIRA, perhaps?

          Jonathan Hsieh We shall see!

          Show
          Nick Dimiduk added a comment - Andrew Purtell Any chance you could describe in more detail what you have in mind? Say, 250 words, as part of a dependent JIRA, perhaps? Jonathan Hsieh We shall see!
          Hide
          Andrew Purtell added a comment -

          Any chance you could describe in more detail what you have in mind?

          Coprocessor "applications" that are running a MasterObserver or MasterEndpoint are very likely going to want to take advantage of a generic framework for master-coordinated tasks, so whatever API there is for this, should be available via MasterServices I'd say.

          Show
          Andrew Purtell added a comment - Any chance you could describe in more detail what you have in mind? Coprocessor "applications" that are running a MasterObserver or MasterEndpoint are very likely going to want to take advantage of a generic framework for master-coordinated tasks, so whatever API there is for this, should be available via MasterServices I'd say.
          Hide
          Nick Dimiduk added a comment -

          https://reviews.apache.org/r/8732/

          Here's an initial hack to introduce Fate. I'll start from here on implementing some Master operations. Bring on the comments

          Show
          Nick Dimiduk added a comment - https://reviews.apache.org/r/8732/ Here's an initial hack to introduce Fate. I'll start from here on implementing some Master operations. Bring on the comments
          Hide
          Ted Yu added a comment -

          @Nick:
          Please specify hbase for the Groups. This way people would get notified when new comments are made.

          conf/accumulo-site.xml seems to be empty. Did you want to put anything there ?

          Show
          Ted Yu added a comment - @Nick: Please specify hbase for the Groups. This way people would get notified when new comments are made. conf/accumulo-site.xml seems to be empty. Did you want to put anything there ?
          Hide
          Nick Dimiduk added a comment -

          Ted Yu

          Please specify hbase for the Groups.

          I intentionally omitted the HBase group because this is not a real patch ready for real review. I simply want feedback from the people directly interested here. It's really a hack to bold the accumulo code into HMaster; I would -1 it

          conf/accumulo-site.xml seems to be empty. Did you want to put anything there ?

          Oops, that was a local symlink. It will run w/o the site.xml file, it just logs a warning. I'll address this in my next patch.

          Show
          Nick Dimiduk added a comment - Ted Yu Please specify hbase for the Groups. I intentionally omitted the HBase group because this is not a real patch ready for real review. I simply want feedback from the people directly interested here. It's really a hack to bold the accumulo code into HMaster; I would -1 it conf/accumulo-site.xml seems to be empty. Did you want to put anything there ? Oops, that was a local symlink. It will run w/o the site.xml file, it just logs a warning. I'll address this in my next patch.
          Hide
          Enis Soztutar added a comment -

          Nick Dimiduk I guess you won't be working on this for some time, but wanna chime in the latest status?

          Thinking about the use case, what we want is to ensure that client operations does outlive master failover. Which is why in accumulo/fate, the state is kept in zk, and the master just provides execution. I think we can achieve the same thing if we add a WAL for master. Again, we have to break up the operation (like create table) into adempotent pieces, and sync the WAL before executing them. On master failover we just have to replay the WAL. Not sure which one would be simpler though.

          Show
          Enis Soztutar added a comment - Nick Dimiduk I guess you won't be working on this for some time, but wanna chime in the latest status? Thinking about the use case, what we want is to ensure that client operations does outlive master failover. Which is why in accumulo/fate, the state is kept in zk, and the master just provides execution. I think we can achieve the same thing if we add a WAL for master. Again, we have to break up the operation (like create table) into adempotent pieces, and sync the WAL before executing them. On master failover we just have to replay the WAL. Not sure which one would be simpler though.
          Hide
          Nick Dimiduk added a comment -

          I cannot comment on a WAL implementation vs this FATE stuff. Before I left on holiday, I had an FateManager integrated into HMaster and mostly ported over CreateTable as a FATE repo. The next task is to extract a Fate client and push it out to the client HBaseAdmin. I can make a priority to clean up my pending code and post another review, particularly if someone else is chomping to pick up the FATE PoC.

          Please advise.

          Show
          Nick Dimiduk added a comment - I cannot comment on a WAL implementation vs this FATE stuff. Before I left on holiday, I had an FateManager integrated into HMaster and mostly ported over CreateTable as a FATE repo. The next task is to extract a Fate client and push it out to the client HBaseAdmin. I can make a priority to clean up my pending code and post another review, particularly if someone else is chomping to pick up the FATE PoC. Please advise.
          Hide
          Enis Soztutar added a comment -

          One more benefit for going WAL can be that you can replicate master operations (if you want to) to remote cluster. I am checking the HLog infrastructure right now, it does not look too intrusive to add an hlog for the master. Of course there would be some work for log splitting (or lack thereof), replaying, rolling, and serializing the operation steps as WALEdits.

          Show
          Enis Soztutar added a comment - One more benefit for going WAL can be that you can replicate master operations (if you want to) to remote cluster. I am checking the HLog infrastructure right now, it does not look too intrusive to add an hlog for the master. Of course there would be some work for log splitting (or lack thereof), replaying, rolling, and serializing the operation steps as WALEdits.
          Hide
          stack added a comment -

          FATE would violate one of our "Invariants", no permanent data in zk: See http://hbase.apache.org/book.html#developing 15.10.4. We could change the 'invariant' (smile) or we could do the Enis suggestion of a master WAL. That seems like a nice idea. How would we delete items from the WAL when done? Write a 'done' record? Let a WAL go when all of its tasks are 'DONE'?

          Show
          stack added a comment - FATE would violate one of our "Invariants", no permanent data in zk: See http://hbase.apache.org/book.html#developing 15.10.4. We could change the 'invariant' (smile) or we could do the Enis suggestion of a master WAL. That seems like a nice idea. How would we delete items from the WAL when done? Write a 'done' record? Let a WAL go when all of its tasks are 'DONE'?
          Hide
          Nick Dimiduk added a comment -

          FATE's approach of serializing the steps along the way is easily replicated via WAL. Write a step to the WAL to initiate, then write the next step, and the next, &c as you go. When the whole thing is complete, write a completion record.

          The WAL approach means a client cannot initiate tasks though, correct?

          Show
          Nick Dimiduk added a comment - FATE's approach of serializing the steps along the way is easily replicated via WAL. Write a step to the WAL to initiate, then write the next step, and the next, &c as you go. When the whole thing is complete, write a completion record. The WAL approach means a client cannot initiate tasks though, correct?
          Hide
          Jesse Yates added a comment -

          I think using the WAL is not the ideal mechanism; it makes master failover a very heavy-weight operation (requires WAL replay) and it makes it really hard to reason about what changes should be replicated to the target clusters and which shouldn't (maybe one cluster mirrors prod, but the other is stores 2x history).

          We already require ZK to be up and store an awful amount of state up there to the point where we are hurting if we don't have it (particularly replication).

          An idea that Chris Trezzo just proposed (and I really like) is to move this to a 'system level table'. Since we don't lose data in HBase, we can be sure it survives master failover, but has approaching zero MTTR and maintains our current invariants. Only problem is the added complexity

          Show
          Jesse Yates added a comment - I think using the WAL is not the ideal mechanism; it makes master failover a very heavy-weight operation (requires WAL replay) and it makes it really hard to reason about what changes should be replicated to the target clusters and which shouldn't (maybe one cluster mirrors prod, but the other is stores 2x history). We already require ZK to be up and store an awful amount of state up there to the point where we are hurting if we don't have it (particularly replication). An idea that Chris Trezzo just proposed (and I really like) is to move this to a 'system level table'. Since we don't lose data in HBase, we can be sure it survives master failover, but has approaching zero MTTR and maintains our current invariants. Only problem is the added complexity
          Hide
          Sergey Shelukhin added a comment -

          I'd question the transient-ZK invariant, it seems that if we got rid of it we could do away with lots of complexity (and maybe an entire class of things like META).
          However, if we were to keep it I wonder if hybrid ZK-WAL approach makes sense.
          ZK doesn't lose data I presume, so in most cases failover is ZK only. If ZK state is wiped, replay WAL.

          Show
          Sergey Shelukhin added a comment - I'd question the transient-ZK invariant, it seems that if we got rid of it we could do away with lots of complexity (and maybe an entire class of things like META). However, if we were to keep it I wonder if hybrid ZK-WAL approach makes sense. ZK doesn't lose data I presume, so in most cases failover is ZK only. If ZK state is wiped, replay WAL.
          Hide
          Jonathan Hsieh added a comment - - edited

          Why not write to a system table, (like META) using the existing wal? This question came up for snapshot clone/restore recovery, and is a potential solution for maintaining table enable/disable state (which is in zk and violates the rule), and possibly for replication state (also in zk, violating the rule).

          Show
          Jonathan Hsieh added a comment - - edited Why not write to a system table, (like META) using the existing wal? This question came up for snapshot clone/restore recovery, and is a potential solution for maintaining table enable/disable state (which is in zk and violates the rule), and possibly for replication state (also in zk, violating the rule).
          Hide
          Devaraj Das added a comment -

          Having a system table to store all state that we need to survive across master restarts makes sense to me. That would avoid having the master to rebuild state when it resumes operation (assuming that we don't store 'permanent' state in ZK). But we'd then have two mechanisms to store state - ZK and system-table. We'd need to be careful about what state goes where and all that.
          The other thing that comes to my mind is what'd happen if ZK went down while a cluster is operational. Is that an invariant? Something that must never happen?

          Show
          Devaraj Das added a comment - Having a system table to store all state that we need to survive across master restarts makes sense to me. That would avoid having the master to rebuild state when it resumes operation (assuming that we don't store 'permanent' state in ZK). But we'd then have two mechanisms to store state - ZK and system-table. We'd need to be careful about what state goes where and all that. The other thing that comes to my mind is what'd happen if ZK went down while a cluster is operational. Is that an invariant? Something that must never happen?
          Hide
          Sergey Shelukhin added a comment -

          If we want ZK to always be running then system table is no more reliable than ZK (from the perspective of uptime).

          Show
          Sergey Shelukhin added a comment - If we want ZK to always be running then system table is no more reliable than ZK (from the perspective of uptime).
          Hide
          Enis Soztutar added a comment -

          FATE would violate one of our "Invariants", no permanent data in zk:

          I think it depends on the definition of permanent. FATE keeps the stack of operation states in zk, but when the whole thing finishes, no zk data is left. You can still wipe out zk, if you are in a clean state.

          FATE's approach of serializing the steps along the way is easily replicated via WAL. Write a step to the WAL to initiate, then write the next step, and the next, &c as you go. When the whole thing is complete, write a completion record.

          Agreed.

          I think using the WAL is not the ideal mechanism; it makes master failover a very heavy-weight operation

          Not sure why it is so heavy weight. It requires to replay only the non-finished steps of the operation of undo the operation, so that we can clean state. As in fate, we can divide every operation (create table, disable, etc) as a sequence of steps, which are idempotent, and undo/redo'able. Before the operation, and before every step master syncs the step to the WAL, after the step is done, master does another sync to record a DONE state for that step. On replay, it just buffers non-finished operations (ignores already finished ones), and undos/redos the steps not finished.
          We can do size based and periodic log rolling, and for every non-finished operation keep the smallest seqNum. When operation is done, we can just release that seqId, so that we can use the same fshlog infrastructure.

          An idea that Chris Trezzo just proposed (and I really like) is to move this to a 'system level table'.

          I must have missed this. This will still be a WAL I guess, but kept as an hbase table. Considering megastore also keeps its WAL in bigtable,
          I think it is doable.

          Why not write to a system table, (like META) using the existing wal? This question came up for snapshot clone/restore recovery, and is a potential solution for maintaining table enable/disable state (which is in zk and violates the rule)

          I never liked the table enabled/disabled state, and agreed that it is a violation, as well as a major cause of bugs. I guess, regardless of where we keep it (fshlog or system table), it is clear that we need a WAL.

          Show
          Enis Soztutar added a comment - FATE would violate one of our "Invariants", no permanent data in zk: I think it depends on the definition of permanent. FATE keeps the stack of operation states in zk, but when the whole thing finishes, no zk data is left. You can still wipe out zk, if you are in a clean state. FATE's approach of serializing the steps along the way is easily replicated via WAL. Write a step to the WAL to initiate, then write the next step, and the next, &c as you go. When the whole thing is complete, write a completion record. Agreed. I think using the WAL is not the ideal mechanism; it makes master failover a very heavy-weight operation Not sure why it is so heavy weight. It requires to replay only the non-finished steps of the operation of undo the operation, so that we can clean state. As in fate, we can divide every operation (create table, disable, etc) as a sequence of steps, which are idempotent, and undo/redo'able. Before the operation, and before every step master syncs the step to the WAL, after the step is done, master does another sync to record a DONE state for that step. On replay, it just buffers non-finished operations (ignores already finished ones), and undos/redos the steps not finished. We can do size based and periodic log rolling, and for every non-finished operation keep the smallest seqNum. When operation is done, we can just release that seqId, so that we can use the same fshlog infrastructure. An idea that Chris Trezzo just proposed (and I really like) is to move this to a 'system level table'. I must have missed this. This will still be a WAL I guess, but kept as an hbase table. Considering megastore also keeps its WAL in bigtable, I think it is doable. Why not write to a system table, (like META) using the existing wal? This question came up for snapshot clone/restore recovery, and is a potential solution for maintaining table enable/disable state (which is in zk and violates the rule) I never liked the table enabled/disabled state, and agreed that it is a violation, as well as a major cause of bugs. I guess, regardless of where we keep it (fshlog or system table), it is clear that we need a WAL.
          Hide
          Jonathan Hsieh added a comment -

          Another argument for having all this state in a system table/meta (and thus in HDFS) and not having permanent state in ZK is that if you are doing something like "backup" of hbase you can do it by just by copying data form HDFS. Administatively you could use the future hdfs snapshot feature or just copy the files from one cluster to another and have a decent chance of recovering (likely still painful, but has a good chance of success). Having an extra step of having to copy data from one ZK instance to another sounds more complicated and even more error prone.

          Show
          Jonathan Hsieh added a comment - Another argument for having all this state in a system table/meta (and thus in HDFS) and not having permanent state in ZK is that if you are doing something like "backup" of hbase you can do it by just by copying data form HDFS. Administatively you could use the future hdfs snapshot feature or just copy the files from one cluster to another and have a decent chance of recovering (likely still painful, but has a good chance of success). Having an extra step of having to copy data from one ZK instance to another sounds more complicated and even more error prone.
          Hide
          Sergey Shelukhin added a comment -

          My main point is that currently two external region states in ZK and a table, plus two complex internal states in server and master, are a root of some non-trivial part of all evil. Especially the nature of ZK state, that comes and goes. Imho we should remove one of them from being actively managed.
          ZK has notifications and seems better suited for locking/atomic updates; w.r.t. availability it has no disadvantage since everything (e.g. locating the root) fails without ZK anyway, even if we do remove state machines from there.
          System tables are more native to HBase and have built-in WAL, plus have advantages for recovery.

          Maybe instead of WAL we can use ZK as universal source of region state (w/o assorted transient nodes e.g. one node per region that is always there, or maybe two if we want to use lock with lease to unassign) and mirror it to system table that is only used for recovery like you describe, or when ZK state disappears?
          Otherwise I think we should just use system table as universal source of region state and get rid of ZK region state.
          With one source of truth master and server logic can probably be dumber.

          Show
          Sergey Shelukhin added a comment - My main point is that currently two external region states in ZK and a table, plus two complex internal states in server and master, are a root of some non-trivial part of all evil. Especially the nature of ZK state, that comes and goes. Imho we should remove one of them from being actively managed. ZK has notifications and seems better suited for locking/atomic updates; w.r.t. availability it has no disadvantage since everything (e.g. locating the root) fails without ZK anyway, even if we do remove state machines from there. System tables are more native to HBase and have built-in WAL, plus have advantages for recovery. Maybe instead of WAL we can use ZK as universal source of region state (w/o assorted transient nodes e.g. one node per region that is always there, or maybe two if we want to use lock with lease to unassign) and mirror it to system table that is only used for recovery like you describe, or when ZK state disappears? Otherwise I think we should just use system table as universal source of region state and get rid of ZK region state. With one source of truth master and server logic can probably be dumber.
          Hide
          Jonathan Hsieh added a comment -

          My main point is that currently two external region states in ZK and a table, plus two complex internal states in server and master, are a root of some non-trivial part of all evil. Especially the nature of ZK state, that comes and goes. Imho we should remove one of them from being actively managed.

          ZK has notifications and seems better suited for locking/atomic updates; w.r.t. availability it has no disadvantage since everything (e.g. locating the root) fails without ZK anyway, even if we do remove state machines from there.

          System tables are more native to HBase and have built-in WAL, plus have advantages for recovery.

          ZK notifications are useful when there are two way communications – log splitting, region server initiated splits. I do agree that opens and closes seems more complicated than necessary.

          Maybe instead of WAL we can use ZK as universal source of region state (w/o assorted transient nodes e.g. one node per region that is always there, or maybe two if we want to use lock with lease to unassign) and mirror it to system table that is only used for recovery like you describe, or when ZK state disappears?

          Otherwise I think we should just use system table as universal source of region state and get rid of ZK region state.

          With one source of truth master and server logic can probably be dumber.

          Simpler, not dumber.

          The 0.20/0.89 version of the master actually had most things in meta – and there are definitely some trade-offs with that approach. In the transition to the 0.90 style master we traded some pain points for new ones. If we change this again we need to make sure we keep those previous ones in mind to not duplicate the worst of them again.

          Is there any chance we could get a high level design deck/doc that illustrates these processes currently and what looks like after we move to this proposed FATE-like mechanism? Also, what operations would eventually get ported to this mechanism? I think discussion and an example at the design/rpc comms level would help a whole lot by grounding this conversion in reality and not require diving into the code. Once we basically agree on design, code reviews would be easier because they'd be focused on the implementation matching the design.

          Show
          Jonathan Hsieh added a comment - My main point is that currently two external region states in ZK and a table, plus two complex internal states in server and master, are a root of some non-trivial part of all evil. Especially the nature of ZK state, that comes and goes. Imho we should remove one of them from being actively managed. ZK has notifications and seems better suited for locking/atomic updates; w.r.t. availability it has no disadvantage since everything (e.g. locating the root) fails without ZK anyway, even if we do remove state machines from there. System tables are more native to HBase and have built-in WAL, plus have advantages for recovery. ZK notifications are useful when there are two way communications – log splitting, region server initiated splits. I do agree that opens and closes seems more complicated than necessary. Maybe instead of WAL we can use ZK as universal source of region state (w/o assorted transient nodes e.g. one node per region that is always there, or maybe two if we want to use lock with lease to unassign) and mirror it to system table that is only used for recovery like you describe, or when ZK state disappears? Otherwise I think we should just use system table as universal source of region state and get rid of ZK region state. With one source of truth master and server logic can probably be dumber. Simpler, not dumber. The 0.20/0.89 version of the master actually had most things in meta – and there are definitely some trade-offs with that approach. In the transition to the 0.90 style master we traded some pain points for new ones. If we change this again we need to make sure we keep those previous ones in mind to not duplicate the worst of them again. Is there any chance we could get a high level design deck/doc that illustrates these processes currently and what looks like after we move to this proposed FATE-like mechanism? Also, what operations would eventually get ported to this mechanism? I think discussion and an example at the design/rpc comms level would help a whole lot by grounding this conversion in reality and not require diving into the code. Once we basically agree on design, code reviews would be easier because they'd be focused on the implementation matching the design.
          Hide
          Nick Dimiduk added a comment -

          I've pushed my FATE WIP to github so you guys can see what a fate repo might look like for us. I appear to have introduced a bug while porting over the logic, but the general ideas is there.

          Show
          Nick Dimiduk added a comment - I've pushed my FATE WIP to github so you guys can see what a fate repo might look like for us. I appear to have introduced a bug while porting over the logic, but the general ideas is there.
          Hide
          Nick Dimiduk added a comment -

          Removing myself from assignment as my priorities have shifted.

          Show
          Nick Dimiduk added a comment - Removing myself from assignment as my priorities have shifted.
          Hide
          Enis Soztutar added a comment -

          Is there any chance we could get a high level design deck/doc that illustrates these processes currently and what looks like after we move to this proposed FATE-like mechanism? Also, what operations would eventually get ported to this mechanism? I think discussion and an example at the design/rpc comms level would help a whole lot by grounding this conversion in reality and not require diving into the code. Once we basically agree on design, code reviews would be easier because they'd be focused on the implementation matching the design.

          Agreed that before doing anything, we should do a design doc, and agree on that. I think what we are discussing here is pre-design discussions.

          Show
          Enis Soztutar added a comment - Is there any chance we could get a high level design deck/doc that illustrates these processes currently and what looks like after we move to this proposed FATE-like mechanism? Also, what operations would eventually get ported to this mechanism? I think discussion and an example at the design/rpc comms level would help a whole lot by grounding this conversion in reality and not require diving into the code. Once we basically agree on design, code reviews would be easier because they'd be focused on the implementation matching the design. Agreed that before doing anything, we should do a design doc, and agree on that. I think what we are discussing here is pre-design discussions.
          Hide
          Jonathan Hsieh added a comment -

          I don't mean to stymie progress here, but I want to understand how the semantics of this will work in comparison to what exists currently. Even in these pre-design steps, and since nick has already done some preliminary work, a distilled example in pseudo code or just the rpc/storage interactions would be helpful. Generally a little bit of effort there is easier than going through a few thousand lines of code...

          Show
          Jonathan Hsieh added a comment - I don't mean to stymie progress here, but I want to understand how the semantics of this will work in comparison to what exists currently. Even in these pre-design steps, and since nick has already done some preliminary work, a distilled example in pseudo code or just the rpc/storage interactions would be helpful. Generally a little bit of effort there is easier than going through a few thousand lines of code...
          Hide
          Sergey Shelukhin added a comment -

          I have some different priorities currently, but if I have time I will try to do short write-up on ZK + backup table based approach. Maybe towards the eow...

          Show
          Sergey Shelukhin added a comment - I have some different priorities currently, but if I have time I will try to do short write-up on ZK + backup table based approach. Maybe towards the eow...
          Hide
          Sergey Shelukhin added a comment -

          Just checking; is this issue still relevant? I am particularly interested in all-encompassing solution for region management that would allow you to read AM and not go insane (I was just reading/debugging 0.94 AM for some time, that's why I remembered). I have a sketch of an idea, should I write it up?

          Show
          Sergey Shelukhin added a comment - Just checking; is this issue still relevant? I am particularly interested in all-encompassing solution for region management that would allow you to read AM and not go insane (I was just reading/debugging 0.94 AM for some time, that's why I remembered). I have a sketch of an idea, should I write it up?
          Hide
          Enis Soztutar added a comment -

          Yes, please write is up. Although AM related things were not initially in the main focus.

          Show
          Enis Soztutar added a comment - Yes, please write is up. Although AM related things were not initially in the main focus.
          Hide
          Jonathan Hsieh added a comment -

          +1 Please do a write up – the standard stuff – what problems exist (op1 and op2 clashes, etc), the proposed fix, and how it would fix it.

          Show
          Jonathan Hsieh added a comment - +1 Please do a write up – the standard stuff – what problems exist (op1 and op2 clashes, etc), the proposed fix, and how it would fix it.
          Hide
          Sergey Shelukhin added a comment -

          Attaching v0 write-up for region management.

          Show
          Sergey Shelukhin added a comment - Attaching v0 write-up for region management.
          Hide
          Sergey Shelukhin added a comment -

          Any opinion? Thanks. I think the same model should be used for all tables too, but it's less necessary at this time...

          Show
          Sergey Shelukhin added a comment - Any opinion? Thanks. I think the same model should be used for all tables too, but it's less necessary at this time...
          Hide
          Jonathan Hsieh added a comment -

          Thanks for writing this up. I read the first two sections and haven't spent time reading the details of the others yet. (already have a bunch of quetsions).

          So the only problem is that the assignment manager and ssh is difficult to reason about? Is the assignment manager the only "master coordinated" task in scope?

          I think there are more problems than that that we should enumerate:

          • Instead of asserting it is not clear if table (+region) locks scale, let's find out.
          • Master operations and processes can clash and we should understand where we need concurrency control. (I'm working on a table – here's an draft distilled version [1], there exists an overly detailed version that I'll share once i get it fixed)
          • Should there be a notion of queuing operations? (locking, or an actual queue) Should these operations be generically logged so they can complete if a master goes down in the middle? (ex: master goes down during a "move" operation after the close but before the open on the new rs).

          The "design principles" is actually more of a proposed design.

          Design principles::region record

          • how do we deal with operations where we need "locks" on multiple region because we are reading or modifying multiple regions – e.g. splits, merges, snapshots? Matteo Bertozzi had suggested in another jira making a the meta row per table, or maybe part of the solution is using the multi-row single meta region transaction.

          What are alternatives? why this approach vs others?

          [1] https://docs.google.com/spreadsheet/ccc?key=0AiVCAt6zRttFdDRVaG56RnpQZlNNZUNsRVJsSXM3YlE&usp=sharing

          Show
          Jonathan Hsieh added a comment - Thanks for writing this up. I read the first two sections and haven't spent time reading the details of the others yet. (already have a bunch of quetsions). So the only problem is that the assignment manager and ssh is difficult to reason about? Is the assignment manager the only "master coordinated" task in scope? I think there are more problems than that that we should enumerate: Instead of asserting it is not clear if table (+region) locks scale, let's find out. Master operations and processes can clash and we should understand where we need concurrency control. (I'm working on a table – here's an draft distilled version [1] , there exists an overly detailed version that I'll share once i get it fixed) Should there be a notion of queuing operations? (locking, or an actual queue) Should these operations be generically logged so they can complete if a master goes down in the middle? (ex: master goes down during a "move" operation after the close but before the open on the new rs). The "design principles" is actually more of a proposed design. Design principles::region record how do we deal with operations where we need "locks" on multiple region because we are reading or modifying multiple regions – e.g. splits, merges, snapshots? Matteo Bertozzi had suggested in another jira making a the meta row per table, or maybe part of the solution is using the multi-row single meta region transaction. What are alternatives? why this approach vs others? [1] https://docs.google.com/spreadsheet/ccc?key=0AiVCAt6zRttFdDRVaG56RnpQZlNNZUNsRVJsSXM3YlE&usp=sharing
          Hide
          Jimmy Xiang added a comment -

          Where do you think the new information will be, META table?

          Show
          Jimmy Xiang added a comment - Where do you think the new information will be, META table?
          Hide
          Sergey Shelukhin added a comment -

          Is the assignment manager the only "master coordinated" task in scope?

          Only for the current document version... tables could be added.

          Instead of asserting it is not clear if table (+region) locks scale, let's find out.

          Hmm... that would require implementing region locks, and having a very large cluster. I am talking more about unacceptable blocking of user operations, and management of expiring locks in presense of real-life failures.

          Master operations and processes can clash and we should understand where we need concurrency control. (I'm working on a table – here's an draft distilled version [1], there exists an overly detailed version that I'll share once i get it fixed)

          Comments below.

          Should there be a notion of queuing operations? (locking, or an actual queue) Should these operations be generically logged so they can complete if a master goes down in the middle? (ex: master goes down during a "move" operation after the close but before the open on the new rs).

          You mean like WAL for operations?

          The "design principles" is actually more of a proposed design.

          Yeah, sorry, wanted to split it into two sections but never did. Will rename.

          how do we deal with operations where we need "locks" on multiple region because we are reading or modifying multiple regions – e.g. splits, merges, snapshots? Matteo Bertozzi had suggested in another jira making a the meta row per table, or maybe part of the solution is using the multi-row single meta region transaction.

          Depends on where we store it, but yeah these have to be transactional. Last section (very short ) suggests using ZK, which already supports that.

          What are alternatives? why this approach vs others?

          I can expand the doc... the implicitly mentioned existing alternatives are locks, which I would argue scale less and are harder to manage; or transaction approach that is currently used (although not unified), for example via transient transaction nodes.

          Actually, one alternative approach I saw used for such things is to simplify concurrency of operations/etc. with actor-like model, where master has logical cluster state and previously saved target state, and periodically (often) takes an epic lock, looks at them quickly, and based on what it is doing, outputs new target cluster state and a list of physical things to do Then it releases epic lock, and the new target state is saved, and operations performed.
          That way all state-management code becomes simple, because it runs in one place with no concurrency, and recovery just has to compare real cluster state with destination state.
          But this will require thinking about this differently.
          Also usually that would mean RSes won't be able to initiate operations (like split) - they will have to go thru master (which I would argue is ok).
          Also it's not clear whether this will become too much of a bottleneck.

          Where do you think the new information will be, META table?

          It seems to me that ZK would be better (see last section), but META is also an option.

          From the spreadsheet:

          Enabling and disabling table operations should be blocked when any of these simple region operations are in progress

          Not clear why (logically).

          move

          Move is close and open, doesn't require consistency, right?

          Regionserver Processes ... However, the individual operations must maintain the table integrity property.

          Not clear what this means for snapshots.

          Show
          Sergey Shelukhin added a comment - Is the assignment manager the only "master coordinated" task in scope? Only for the current document version... tables could be added. Instead of asserting it is not clear if table (+region) locks scale, let's find out. Hmm... that would require implementing region locks, and having a very large cluster. I am talking more about unacceptable blocking of user operations, and management of expiring locks in presense of real-life failures. Master operations and processes can clash and we should understand where we need concurrency control. (I'm working on a table – here's an draft distilled version [1] , there exists an overly detailed version that I'll share once i get it fixed) Comments below. Should there be a notion of queuing operations? (locking, or an actual queue) Should these operations be generically logged so they can complete if a master goes down in the middle? (ex: master goes down during a "move" operation after the close but before the open on the new rs). You mean like WAL for operations? The "design principles" is actually more of a proposed design. Yeah, sorry, wanted to split it into two sections but never did. Will rename. how do we deal with operations where we need "locks" on multiple region because we are reading or modifying multiple regions – e.g. splits, merges, snapshots? Matteo Bertozzi had suggested in another jira making a the meta row per table, or maybe part of the solution is using the multi-row single meta region transaction. Depends on where we store it, but yeah these have to be transactional. Last section (very short ) suggests using ZK, which already supports that. What are alternatives? why this approach vs others? I can expand the doc... the implicitly mentioned existing alternatives are locks, which I would argue scale less and are harder to manage; or transaction approach that is currently used (although not unified), for example via transient transaction nodes. Actually, one alternative approach I saw used for such things is to simplify concurrency of operations/etc. with actor-like model, where master has logical cluster state and previously saved target state, and periodically (often) takes an epic lock, looks at them quickly, and based on what it is doing, outputs new target cluster state and a list of physical things to do Then it releases epic lock, and the new target state is saved, and operations performed. That way all state-management code becomes simple, because it runs in one place with no concurrency, and recovery just has to compare real cluster state with destination state. But this will require thinking about this differently. Also usually that would mean RSes won't be able to initiate operations (like split) - they will have to go thru master (which I would argue is ok). Also it's not clear whether this will become too much of a bottleneck. Where do you think the new information will be, META table? It seems to me that ZK would be better (see last section), but META is also an option. From the spreadsheet: Enabling and disabling table operations should be blocked when any of these simple region operations are in progress Not clear why (logically). move Move is close and open, doesn't require consistency, right? Regionserver Processes ... However, the individual operations must maintain the table integrity property. Not clear what this means for snapshots.
          Hide
          Jonathan Hsieh added a comment -

          To do a major overhaul, we need something stronger than "the code is hard to read". I agree that it is hard to follow (see: http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf) but it seems to be basically working which is a pretty strong argument. Let's compare and point out what is wrong/broken in the current implementation and how the new design won't have those problems.

          The spreadsheet link is my first step to enumerating semantics and distilling the set of possible problems and things that are being guarded from races. Any major-overhaul solution should make sure that these operations, when issued concurrently, interact according to a sane set of semantics in the face of failures.

          Only for the current document version... tables could be added

          So I buy open/close as a region operation. split/merge are multi region operations – is there enough state to recover from a failure?

          So alter table is a region operation? Why isn't it in the state machine?

          Hmm... that would require implementing region locks, and having a very large cluster. I am talking more about unacceptable blocking of user operations, and management of expiring locks in presense of real-life failures.

          Implementing region locks is too far – I'm asking for some back of the napkin discussionb. I think we need some measurements how much throughput we can get in ZK or with a ZK-lock implementation and compare his with # rs of watchers * # of regions * number of ops..

          The current regions-in-transition (RIT) code basically assumes that an absent znode is either closed or opened. RIT znodes are present when the region is in the inbetween states (opening, closing,

          You mean like WAL for operations?

          Yeah, we could call it an "intent" log. It would have info so that a promoted backup master can look in one place and complete an operation started by the downed original master.

          ... Also usually that would mean RSes won't be able to initiate operations (like split) - they will have to go thru master (which I would argue is ok).

          I know I've suggested something like this before. Currently the RS initiates a split, and does the region open/meta changes. If there are errors, at some point the master side detects a timeout. An alternative would have splits initiated RS on the rs but have the master do some kind of atomic changes to meta and region state for the 3 involved regions (parent, daughter a and daughter b).

          Depends on where we store it, but yeah these have to be transactional. Last section (very short ) suggests using ZK, which already supports that.

          We need to be careful about ZK – since it is a network connection also, exceptions could be failures or timeouts (which succeed but wan't able to ack). If we can describe the properties (durable vs erasable) and assumptions (if the wipeable ZK is source of truth, how do we make sure the version state is recoverable without time travel?)

          Show
          Jonathan Hsieh added a comment - To do a major overhaul, we need something stronger than "the code is hard to read". I agree that it is hard to follow (see: http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf ) but it seems to be basically working which is a pretty strong argument. Let's compare and point out what is wrong/broken in the current implementation and how the new design won't have those problems. The spreadsheet link is my first step to enumerating semantics and distilling the set of possible problems and things that are being guarded from races. Any major-overhaul solution should make sure that these operations, when issued concurrently, interact according to a sane set of semantics in the face of failures. Only for the current document version... tables could be added So I buy open/close as a region operation. split/merge are multi region operations – is there enough state to recover from a failure? So alter table is a region operation? Why isn't it in the state machine? Hmm... that would require implementing region locks, and having a very large cluster. I am talking more about unacceptable blocking of user operations, and management of expiring locks in presense of real-life failures. Implementing region locks is too far – I'm asking for some back of the napkin discussionb. I think we need some measurements how much throughput we can get in ZK or with a ZK-lock implementation and compare his with # rs of watchers * # of regions * number of ops.. The current regions-in-transition (RIT) code basically assumes that an absent znode is either closed or opened. RIT znodes are present when the region is in the inbetween states (opening, closing, You mean like WAL for operations? Yeah, we could call it an "intent" log. It would have info so that a promoted backup master can look in one place and complete an operation started by the downed original master. ... Also usually that would mean RSes won't be able to initiate operations (like split) - they will have to go thru master (which I would argue is ok). I know I've suggested something like this before. Currently the RS initiates a split, and does the region open/meta changes. If there are errors, at some point the master side detects a timeout. An alternative would have splits initiated RS on the rs but have the master do some kind of atomic changes to meta and region state for the 3 involved regions (parent, daughter a and daughter b). Depends on where we store it, but yeah these have to be transactional. Last section (very short ) suggests using ZK, which already supports that. We need to be careful about ZK – since it is a network connection also, exceptions could be failures or timeouts (which succeed but wan't able to ack). If we can describe the properties (durable vs erasable) and assumptions (if the wipeable ZK is source of truth, how do we make sure the version state is recoverable without time travel?)
          Hide
          Jonathan Hsieh added a comment -

          I'll deal with the spreadsheet comments related by putting int somewhere that comments can be easily dropped into.

          Show
          Jonathan Hsieh added a comment - I'll deal with the spreadsheet comments related by putting int somewhere that comments can be easily dropped into.
          Hide
          Enis Soztutar added a comment -

          Yeah, we could call it an "intent" log. It would have info so that a promoted backup master can look in one place and complete an operation started by the downed original master.

          This was what I proposed in an earlier comment:
          https://issues.apache.org/jira/browse/HBASE-5487?focusedCommentId=13551519&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13551519
          What we need is a transactional authoritative, fault tolerant, and durable source for ground truth about the cluster state, and execution state. Whether we can do it using ZK or a master WAL, or a system table (using an implicit WAL), we will have to figure it out.

          Show
          Enis Soztutar added a comment - Yeah, we could call it an "intent" log. It would have info so that a promoted backup master can look in one place and complete an operation started by the downed original master. This was what I proposed in an earlier comment: https://issues.apache.org/jira/browse/HBASE-5487?focusedCommentId=13551519&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13551519 What we need is a transactional authoritative, fault tolerant, and durable source for ground truth about the cluster state, and execution state. Whether we can do it using ZK or a master WAL, or a system table (using an implicit WAL), we will have to figure it out.
          Hide
          Sergey Shelukhin added a comment -

          Jonathan Hsieh Will reply in details later/tomorrow, one clarification - the justification was not just "code is hard to understand", but "code is hard to reason about AND we want to expand it to support a bunch of features.". I think the whole idea of general framework, whatever it is, is to have some unified model of things and way of doing these manipulations and expanding their set, that is not patchwork.

          Enis Soztutar I think finding one is the least of our problems How to use it to store/manage state is the question.

          Show
          Sergey Shelukhin added a comment - Jonathan Hsieh Will reply in details later/tomorrow, one clarification - the justification was not just "code is hard to understand", but "code is hard to reason about AND we want to expand it to support a bunch of features.". I think the whole idea of general framework, whatever it is, is to have some unified model of things and way of doing these manipulations and expanding their set, that is not patchwork. Enis Soztutar I think finding one is the least of our problems How to use it to store/manage state is the question.
          Hide
          Sergey Shelukhin added a comment -

          Any major-overhaul solution should make sure that these operations, when issued concurrently, interact according to a sane set of semantics in the face of failures.

          This is another (although not orthogonal) question.
          I am looking for a sane way to define and enforce arbitrary semantics first. Then sane semantics can be enforced on top of that
          For example, in "actor-ish" model below would make it easy to write simple code; persistent state would make sure there's definite state at any time, and all crucial transitions are atomic, so semantics would be easy to enforce as long as the code can handle a failed transition/recovery. Locks also make this simple, although locks have other problems imho.
          Although we can go both ways, if we define sane semantics it would be easy to see how convenient they are to implement in a particular model.

          So I buy open/close as a region operation. split/merge are multi region operations – is there enough state to recover from a failure?

          There should be. Can you elaborate?

          So alter table is a region operation? Why isn't it in the state machine?

          Alter table is currently the operation that involves region operation, namely open/close. Open-close are in the state machine As for tables, I am not sure state machine is the best model for table state, there isn't that much going on with the table that is properly an exclusive state.

          Implementing region locks is too far – I'm asking for some back of the napkin discussionb.

          If a server holds a lock for a region for time Tlock during each day, and number of regions is N probability of some region lock (or table read-only lock) being held at any given time is (1-(1-(Tlock/Tday))^N), if I am writing this correctly. For 5 seconds of locking per day per region, for 10000 regions (not unreasonable for a large table/cluster) we will be holding some lock about 44% of the time for region operations.
          Calculating the probability of any lock being in recovery (server went down with a lock less than recovery time ago) can also be done, but numbers for some parameters (how often do servers go down?) will be very speculative...

          I think we need some measurements how much throughput we can get in ZK or with a ZK-lock implementation and compare his with # rs of watchers * # of regions * number of ops...

          Will there be many watchers/ops? You only watch and do ops when you acquire the lock, so unless region operations are very frequent...

          The current regions-in-transition (RIT) code basically assumes that an absent znode is either closed or opened. RIT znodes are present when the region is in the inbetween states (opening, closing,

          I don't think "either closed or opened" is good enough Also, RITs don't cover all scenarios and things like table ops don't use them at all.

          I know I've suggested something like this before. Currently the RS initiates a split, and does the region open/meta changes. If there are errors, at some point the master side detects a timeout. An alternative would have splits initiated RS on the rs but have the master do some kind of atomic changes to meta and region state for the 3 involved regions (parent, daughter a and daughter b).

          Yeah, although in other models (locks, persistent state) that is not required. Also if meta is cache for clients and not source of truth meta changes can still be on the server; I assume by meta you mean global state, wherever that is?

          We need to be careful about ZK – since it is a network connection also, exceptions could be failures or timeouts (which succeed but wan't able to ack). If we can describe the properties (durable vs erasable) and assumptions (if the wipeable ZK is source of truth, how do we make sure the version state is recoverable without time travel?)

          The former applies to any distributed state; as for the latter - I was thinking of ZK+"WAL" if we intend to keep ZK wipeable.

          Show
          Sergey Shelukhin added a comment - Any major-overhaul solution should make sure that these operations, when issued concurrently, interact according to a sane set of semantics in the face of failures. This is another (although not orthogonal) question. I am looking for a sane way to define and enforce arbitrary semantics first. Then sane semantics can be enforced on top of that For example, in "actor-ish" model below would make it easy to write simple code; persistent state would make sure there's definite state at any time, and all crucial transitions are atomic, so semantics would be easy to enforce as long as the code can handle a failed transition/recovery. Locks also make this simple, although locks have other problems imho. Although we can go both ways, if we define sane semantics it would be easy to see how convenient they are to implement in a particular model. So I buy open/close as a region operation. split/merge are multi region operations – is there enough state to recover from a failure? There should be. Can you elaborate? So alter table is a region operation? Why isn't it in the state machine? Alter table is currently the operation that involves region operation, namely open/close. Open-close are in the state machine As for tables, I am not sure state machine is the best model for table state, there isn't that much going on with the table that is properly an exclusive state. Implementing region locks is too far – I'm asking for some back of the napkin discussionb. If a server holds a lock for a region for time Tlock during each day, and number of regions is N probability of some region lock (or table read-only lock) being held at any given time is (1-(1-(Tlock/Tday))^N), if I am writing this correctly. For 5 seconds of locking per day per region, for 10000 regions (not unreasonable for a large table/cluster) we will be holding some lock about 44% of the time for region operations. Calculating the probability of any lock being in recovery (server went down with a lock less than recovery time ago) can also be done, but numbers for some parameters (how often do servers go down?) will be very speculative... I think we need some measurements how much throughput we can get in ZK or with a ZK-lock implementation and compare his with # rs of watchers * # of regions * number of ops... Will there be many watchers/ops? You only watch and do ops when you acquire the lock, so unless region operations are very frequent... The current regions-in-transition (RIT) code basically assumes that an absent znode is either closed or opened. RIT znodes are present when the region is in the inbetween states (opening, closing, I don't think "either closed or opened" is good enough Also, RITs don't cover all scenarios and things like table ops don't use them at all. I know I've suggested something like this before. Currently the RS initiates a split, and does the region open/meta changes. If there are errors, at some point the master side detects a timeout. An alternative would have splits initiated RS on the rs but have the master do some kind of atomic changes to meta and region state for the 3 involved regions (parent, daughter a and daughter b). Yeah, although in other models (locks, persistent state) that is not required. Also if meta is cache for clients and not source of truth meta changes can still be on the server; I assume by meta you mean global state, wherever that is? We need to be careful about ZK – since it is a network connection also, exceptions could be failures or timeouts (which succeed but wan't able to ack). If we can describe the properties (durable vs erasable) and assumptions (if the wipeable ZK is source of truth, how do we make sure the version state is recoverable without time travel?) The former applies to any distributed state; as for the latter - I was thinking of ZK+"WAL" if we intend to keep ZK wipeable.
          Hide
          stack added a comment -

          Was just looking at a test failure issue. A move operation was failing because the region we were asking it to move had not fully opened yet so move just failed:

          2013-07-11 09:35:00,489 DEBUG [RpcServer.handler=4,port=55346] master.AssignmentManager(2277): Attempting to unassign ephemeral,,1373535299969.ab63d8b7c5339b4a61fdd70e8cb8993a. but it is already in transition (OPEN, force=false)

          Client needs means of asking what happened to its move.

          Client should also be able to say I want the move to succeed..... in the above case then, the move op should be retried.

          Need a queue of outstanding ops. Need to be able to query it for where is my op? (like fedex where is my package)

          Show
          stack added a comment - Was just looking at a test failure issue. A move operation was failing because the region we were asking it to move had not fully opened yet so move just failed: 2013-07-11 09:35:00,489 DEBUG [RpcServer.handler=4,port=55346] master.AssignmentManager(2277): Attempting to unassign ephemeral,,1373535299969.ab63d8b7c5339b4a61fdd70e8cb8993a. but it is already in transition (OPEN, force=false) Client needs means of asking what happened to its move. Client should also be able to say I want the move to succeed..... in the above case then, the move op should be retried. Need a queue of outstanding ops. Need to be able to query it for where is my op? (like fedex where is my package)
          Hide
          Enis Soztutar added a comment -

          The requirements are pretty much clear. We need a persistent set of ops running in the cluster. Every op is defined as a stack of steps which are undo/redoable and adempotent. The client submits an op, and gets an request_id, with which it can query the state of the operation later. The ops and steps are handled by state machines + stacked execution. The state is persisted via zk or a WAL log for master or an HBase table (which is just a higher level log).

          Show
          Enis Soztutar added a comment - The requirements are pretty much clear. We need a persistent set of ops running in the cluster. Every op is defined as a stack of steps which are undo/redoable and adempotent. The client submits an op, and gets an request_id, with which it can query the state of the operation later. The ops and steps are handled by state machines + stacked execution. The state is persisted via zk or a WAL log for master or an HBase table (which is just a higher level log).
          Hide
          stack added a comment -

          The requirements are pretty much clear...

          Smile. Yeah. I wonder what is 'next'?

          Show
          stack added a comment - The requirements are pretty much clear... Smile. Yeah. I wonder what is 'next'?
          Hide
          Enis Soztutar added a comment -

          Once dust is settled for 0.96, I think this is a good candidate for 0.98

          Show
          Enis Soztutar added a comment - Once dust is settled for 0.96, I think this is a good candidate for 0.98
          Hide
          Sergey Shelukhin added a comment -

          Do these requirements cover the test failure just mentioned... or any other that come from collision of multiple ops on the same region?

          Show
          Sergey Shelukhin added a comment - Do these requirements cover the test failure just mentioned... or any other that come from collision of multiple ops on the same region?
          Hide
          stack added a comment -

          Enis Soztutar Agree
          Sergey Shelukhin No. I think the move 'failing' is not too bad; it is something we can work on perhaps having two types of move... a move "recommendation" and then a "required" move. The offense in my scenario above is that the move failed silently.

          Show
          stack added a comment - Enis Soztutar Agree Sergey Shelukhin No. I think the move 'failing' is not too bad; it is something we can work on perhaps having two types of move... a move "recommendation" and then a "required" move. The offense in my scenario above is that the move failed silently.
          Hide
          Sergey Shelukhin added a comment -

          I would like to resurrect this issue (and rename it). It seems that every fix to assignmentmanager introduces 1-3 more maps or lists around it, makes it even more impossible to comprehend and may add more bugs.

          We have talked a little bit here, and agreed on 2 key points.
          1) Region state should be managed in one permanent place with one state machine; no separate and/or transient state machines, no operation-based state machines.
          2) New assignment manager should be easily testable by simulating sequences in events.
          I think my doc above is still reasonably good approximation, but of course we might need to discuss flesh out the details.

          Ahem. I said new assignment manager, that is because I would like to rename this jira "rewrite assignment manager".
          Wdyt? Enis Soztutar Jimmy Xiang Stack Nick Dimiduk Jonathan Hsieh

          Show
          Sergey Shelukhin added a comment - I would like to resurrect this issue (and rename it). It seems that every fix to assignmentmanager introduces 1-3 more maps or lists around it, makes it even more impossible to comprehend and may add more bugs. We have talked a little bit here, and agreed on 2 key points. 1) Region state should be managed in one permanent place with one state machine; no separate and/or transient state machines, no operation-based state machines. 2) New assignment manager should be easily testable by simulating sequences in events. I think my doc above is still reasonably good approximation, but of course we might need to discuss flesh out the details. Ahem. I said new assignment manager, that is because I would like to rename this jira "rewrite assignment manager". Wdyt? Enis Soztutar Jimmy Xiang Stack Nick Dimiduk Jonathan Hsieh
          Hide
          Sergey Shelukhin added a comment -

          *of events

          Show
          Sergey Shelukhin added a comment - *of events
          Hide
          stack added a comment -

          We have talked a little bit here, and agreed on 2 key points.

          What about Enis's 'requirements'. We agree on his list? That makes 3 key points. I think add a fourth point where we rehearse what is wrong w/ the current system (Jon's suggestion) as it will help ensure we don't repeat the mistakes of the past.

          Make a subtask 'New Assignment Manager'? Or make this a subtask of a new issue called 'New Assignment Manager'. An issue named so will be easier to find than this one. Also, others are interested in this effort (@honghua and Liang Xie) and it'll catch their attention.

          I think it too big a change to be done for the pending 0.98. Lets not rush it. It could even land post hbase 1.0 if 0.98 is to become 1.0.

          On the design doc.,

          + Doc., needs author and date. I would expect a section situating the document – context – that at least referred to the current 'design' – se https://issues.apache.org/jira/browse/HBASE-2485 (it has 'state' machine that looks like this one)
          + The problem section is too short (state kept in multiple places and all have to agree...); need more full list so can be sure proposal addresses them all
          + How is the proposal different from what we currently have? I see us tying regionstate to table state. That is new. But the rest, where we have a record and it is atomically changed looks like our RegionState in Master memory? There is an increasing 'version' which should help ensure a 'direction' for change which should help.
          + A single atomically mutable record of the regionstate is well and good but how then to get the cluster to align w/ what this record says? For example, table record says it is disabled. It has 10k regions. How do we get the regions to agree w/ the Table record which says it is disabled? We can send the closes but how we sure the close happened on all 10k regions?
          + I don't get this bit "This record is the only source of truth about the region, and is never removed while the
          region is relevant.This simplifies current situation where ZK state, master state and
          META state can all conflict in various special ways..." Its fine having a source of truth but ain't the hard part bring the system along? (meta edits, clients, etc.).

          Experience has zk as messy to reason with. It is also an indirection having RS and M go to zk to do 'state'.

          Thank sfor writing this up Sergey

          Show
          stack added a comment - We have talked a little bit here, and agreed on 2 key points. What about Enis's 'requirements'. We agree on his list? That makes 3 key points. I think add a fourth point where we rehearse what is wrong w/ the current system (Jon's suggestion) as it will help ensure we don't repeat the mistakes of the past. Make a subtask 'New Assignment Manager'? Or make this a subtask of a new issue called 'New Assignment Manager'. An issue named so will be easier to find than this one. Also, others are interested in this effort (@honghua and Liang Xie ) and it'll catch their attention. I think it too big a change to be done for the pending 0.98. Lets not rush it. It could even land post hbase 1.0 if 0.98 is to become 1.0. On the design doc., + Doc., needs author and date. I would expect a section situating the document – context – that at least referred to the current 'design' – se https://issues.apache.org/jira/browse/HBASE-2485 (it has 'state' machine that looks like this one) + The problem section is too short (state kept in multiple places and all have to agree...); need more full list so can be sure proposal addresses them all + How is the proposal different from what we currently have? I see us tying regionstate to table state. That is new. But the rest, where we have a record and it is atomically changed looks like our RegionState in Master memory? There is an increasing 'version' which should help ensure a 'direction' for change which should help. + A single atomically mutable record of the regionstate is well and good but how then to get the cluster to align w/ what this record says? For example, table record says it is disabled. It has 10k regions. How do we get the regions to agree w/ the Table record which says it is disabled? We can send the closes but how we sure the close happened on all 10k regions? + I don't get this bit "This record is the only source of truth about the region, and is never removed while the region is relevant.This simplifies current situation where ZK state, master state and META state can all conflict in various special ways..." Its fine having a source of truth but ain't the hard part bring the system along? (meta edits, clients, etc.). Experience has zk as messy to reason with. It is also an indirection having RS and M go to zk to do 'state'. Thank sfor writing this up Sergey
          Hide
          stack added a comment -

          Jimmy Xiang What you think of Sergey Shelukhin doc?

          Show
          stack added a comment - Jimmy Xiang What you think of Sergey Shelukhin doc?
          Hide
          Jimmy Xiang added a comment -

          The doc is very good. I like the Problem and Design Principles sections. The region state machine can be enhanced. We should have a single source of truth which is scalable and performs well. I think it could be the master (in memory). All actions (including split/merge) should be started and managed by the master.

          Show
          Jimmy Xiang added a comment - The doc is very good. I like the Problem and Design Principles sections. The region state machine can be enhanced. We should have a single source of truth which is scalable and performs well. I think it could be the master (in memory). All actions (including split/merge) should be started and managed by the master.
          Hide
          Nicolas Liochon added a comment -

          3 comments:

          • I wonder if we should use "something" that would allow us to test all the possible states. At least, we should make this really testable, without needing to set up a zk, a set of rs and so on.
          • We should question the master based architecture. How does it work for the MapR implementation for example? Why the assignment manager is not in the region server holding meta? This would save one distributed state for example.
          • I really really really ( ) think that we need to put performances as a requirement for any implementation. For example, something like: on a cluster with 5 racks of 20 regionserver each, with 200 regions per RS,, the assignment will be completed in 1s if we lose one rack. I saw a reference to async ZK in the doc, it's great, because the performances are 10 times better.

          Thanks for writing the doc Sergey.

          Show
          Nicolas Liochon added a comment - 3 comments: I wonder if we should use "something" that would allow us to test all the possible states. At least, we should make this really testable, without needing to set up a zk, a set of rs and so on. We should question the master based architecture. How does it work for the MapR implementation for example? Why the assignment manager is not in the region server holding meta? This would save one distributed state for example. I really really really ( ) think that we need to put performances as a requirement for any implementation. For example, something like: on a cluster with 5 racks of 20 regionserver each, with 200 regions per RS,, the assignment will be completed in 1s if we lose one rack. I saw a reference to async ZK in the doc, it's great, because the performances are 10 times better. Thanks for writing the doc Sergey.
          Hide
          stack added a comment -

          + Agree that master should become a lib that any regionserver can run.
          + Agree testable but would like to point out that it is possible to put up our current AM in a standalone mode – it just takes mockery (smile)
          + Agree on perf. Helps MTTR.

          Show
          stack added a comment - + Agree that master should become a lib that any regionserver can run. + Agree testable but would like to point out that it is possible to put up our current AM in a standalone mode – it just takes mockery (smile) + Agree on perf. Helps MTTR.
          Hide
          Jimmy Xiang added a comment -

          Agree that master should hold the meta region. It should hold other system table regions as well.

          Show
          Jimmy Xiang added a comment - Agree that master should hold the meta region. It should hold other system table regions as well.
          Hide
          Jesse Yates added a comment -

          -1 that it should hold other system tables. We know META isn't going to span more than a region, but it would be completely reasonable for other system tables to be larger (i.e. statistics). Maybe worth considering a single-region flag for certain tables to identify that they can never be split and can support single region transactions.

          Show
          Jesse Yates added a comment - -1 that it should hold other system tables. We know META isn't going to span more than a region, but it would be completely reasonable for other system tables to be larger (i.e. statistics). Maybe worth considering a single-region flag for certain tables to identify that they can never be split and can support single region transactions.
          Hide
          Jimmy Xiang added a comment -

          That's right. For big system tables which are not required for the system to start/run, it can be assigned to somewhere else. The master doesn't have to hold all system table regions.

          Show
          Jimmy Xiang added a comment - That's right. For big system tables which are not required for the system to start/run, it can be assigned to somewhere else. The master doesn't have to hold all system table regions.
          Hide
          Devaraj Das added a comment -

          We should have a single source of truth which is scalable and performs well. I think it could be the master (in memory). All actions (including split/merge) should be started and managed by the master.

          I agree with this. We have done this in other components - HDFS, MapReduce, Yarn, etc.... Depending on ZK for the state management brings complexity. I think we should use ZK for only ephemeral stuff and not for storing state there. In that regard, using ZK for discovering lost RSs is fine, but not for storing the region states. I like the idea of the Master WAL to take care of master crashes.

          Show
          Devaraj Das added a comment - We should have a single source of truth which is scalable and performs well. I think it could be the master (in memory). All actions (including split/merge) should be started and managed by the master. I agree with this. We have done this in other components - HDFS, MapReduce, Yarn, etc.... Depending on ZK for the state management brings complexity. I think we should use ZK for only ephemeral stuff and not for storing state there. In that regard, using ZK for discovering lost RSs is fine, but not for storing the region states. I like the idea of the Master WAL to take care of master crashes.
          Hide
          Sergey Shelukhin added a comment -

          we still need a reliable store (ZK, system table, or master WAL). It seems ZK is the most scalable and best suited for the task. In perfect world we would have ZK library that we could host and have a quorum of masters running Paxos/ZK. But we don't have that...

          Show
          Sergey Shelukhin added a comment - we still need a reliable store (ZK, system table, or master WAL). It seems ZK is the most scalable and best suited for the task. In perfect world we would have ZK library that we could host and have a quorum of masters running Paxos/ZK. But we don't have that...
          Hide
          Elliott Clark added a comment -

          we still need a reliable store

          HBase is a reliable store. We should be using it as such for current state.

          If we co-locate the master process with meta, then the master noticing state changes is as simples as loading a co-processor that hooks mutations. It also means that when master wants to look up current state there's no rpc overhead. Simply target the hregion. This allows us to reduce the number of copies of state. No longer will we need a local hash map + what's in zk, + what's in meta.

          I think Jimmy's correct we should use zk for ephemeral only. Everything else should be in our systems.

          Show
          Elliott Clark added a comment - we still need a reliable store HBase is a reliable store. We should be using it as such for current state. If we co-locate the master process with meta, then the master noticing state changes is as simples as loading a co-processor that hooks mutations. It also means that when master wants to look up current state there's no rpc overhead. Simply target the hregion. This allows us to reduce the number of copies of state. No longer will we need a local hash map + what's in zk, + what's in meta. I think Jimmy's correct we should use zk for ephemeral only. Everything else should be in our systems.
          Hide
          Sergey Shelukhin added a comment -

          Please let's not use coprocessors for mainline functionality... also, if we store state in system table that is hosted by master, then we don't need ZK at all, we should get rid of it.
          The only disadvantages from using ZK that I see are the absence getKeyBefore/After API (easy to fix by having ephemeral META table for clients to query), and having extra moving part. If we don't get rid of ZK we don't alleviate the latter so I think we should either use it for "everything" or not at all... I would prefer to use it for everything.
          As far as I see, ZK is more reliable than HBase RS or master, has built-in replication with faster recovery, is probably more scalable than reading from single RS, and has better model for atomic state changes. Probably has better tolerance for stuff like network partitioning too. We could do master WAL and all that stuff but I don't see a compelling reason to do this when we have a bunch of Apache code that is already written to solve all of these problems.
          What is the reason to not use ZK? What is the advantage of system table, or disadvantage of ZK?

          Show
          Sergey Shelukhin added a comment - Please let's not use coprocessors for mainline functionality... also, if we store state in system table that is hosted by master, then we don't need ZK at all, we should get rid of it. The only disadvantages from using ZK that I see are the absence getKeyBefore/After API (easy to fix by having ephemeral META table for clients to query), and having extra moving part. If we don't get rid of ZK we don't alleviate the latter so I think we should either use it for "everything" or not at all... I would prefer to use it for everything. As far as I see, ZK is more reliable than HBase RS or master, has built-in replication with faster recovery, is probably more scalable than reading from single RS, and has better model for atomic state changes. Probably has better tolerance for stuff like network partitioning too. We could do master WAL and all that stuff but I don't see a compelling reason to do this when we have a bunch of Apache code that is already written to solve all of these problems. What is the reason to not use ZK? What is the advantage of system table, or disadvantage of ZK?
          Hide
          Elliott Clark added a comment -

          Please let's not use coprocessors for mainline functionality

          We already do. I don't see anything wrong with making HBase more modular. If there are pain points with using co-processors that cause you to say no, then we should fix those. Not just ignore them.

          also, if we store state in system table that is hosted by master, then we don't need ZK at all, we should get rid of it.

          We don't have ephemeral node capability at all. And we need it for the bootstrap problem. It allows clients to point at a relatively small number of nodes to discover the whole cluster.

          As far as I see, ZK is more reliable than HBase RS or master

          Our master is only complex because of our use of zk to hold and mutate state.

          has built-in replication with faster recovery

          With the meta/system wal I think we can be within an order of magnitude.

          Show
          Elliott Clark added a comment - Please let's not use coprocessors for mainline functionality We already do. I don't see anything wrong with making HBase more modular. If there are pain points with using co-processors that cause you to say no, then we should fix those. Not just ignore them. also, if we store state in system table that is hosted by master, then we don't need ZK at all, we should get rid of it. We don't have ephemeral node capability at all. And we need it for the bootstrap problem. It allows clients to point at a relatively small number of nodes to discover the whole cluster. As far as I see, ZK is more reliable than HBase RS or master Our master is only complex because of our use of zk to hold and mutate state. has built-in replication with faster recovery With the meta/system wal I think we can be within an order of magnitude.
          Hide
          Sergey Shelukhin added a comment -

          Wrt coprocs - that is bad imho, that is not the kind of modular that we want. Core parts of the system should depend on well-defined interfaces, not a generic extension points. Imho, the litmus test for coproc, as a plugin interface, is - can you run HBase without it? If yes, then it's ok to be a coproc (e.g. accesscontrol). Otherwise we should have proper interfaces that have some meaning to the caller.

          Our master is only complex because of our use of zk to hold and mutate state.

          That is not due to ZK as such, that is due to multi-state-machine reconciliation model and truth in multiple places that it requires.
          System table can have exact same problem of state in the table + state in memory, question is how you split and manage state between them, storage substrate doesn't matter as much. If truth was in ZK and nowhere else that wouldn't be a problem, same way as with system table.
          Also, by reliable I meant that ZK is multiple nodes with built-in master recovery by design, whereas with master you need at least HA, and still it's probably worse than ZK in case of failure.
          There are also other things that I mentioned.

          With the meta/system wal I think we can be within an order of magnitude.

          So, why would we write a bunch of new code to get "within an order of magnitude"? I don't see an advantage, or ZK disadvantage that you mention compared to multiple advantages of ZK.
          Esp. if we cannot totally get rid of it, so we'll have an extra service regardless.

          Show
          Sergey Shelukhin added a comment - Wrt coprocs - that is bad imho, that is not the kind of modular that we want. Core parts of the system should depend on well-defined interfaces, not a generic extension points. Imho, the litmus test for coproc, as a plugin interface, is - can you run HBase without it? If yes, then it's ok to be a coproc (e.g. accesscontrol). Otherwise we should have proper interfaces that have some meaning to the caller. Our master is only complex because of our use of zk to hold and mutate state. That is not due to ZK as such, that is due to multi-state-machine reconciliation model and truth in multiple places that it requires. System table can have exact same problem of state in the table + state in memory, question is how you split and manage state between them, storage substrate doesn't matter as much. If truth was in ZK and nowhere else that wouldn't be a problem, same way as with system table. Also, by reliable I meant that ZK is multiple nodes with built-in master recovery by design, whereas with master you need at least HA, and still it's probably worse than ZK in case of failure. There are also other things that I mentioned. With the meta/system wal I think we can be within an order of magnitude. So, why would we write a bunch of new code to get "within an order of magnitude"? I don't see an advantage, or ZK disadvantage that you mention compared to multiple advantages of ZK. Esp. if we cannot totally get rid of it, so we'll have an extra service regardless.
          Hide
          Sergey Shelukhin added a comment -

          Btw I agree that the main point is to get rid of the complexity you mention (and in the doc I only mention storage mechanism in ZK in one paragraph in the end), so the storage mechanism choice is almost orthogonal.
          But as far as it is concerned, it seems an obvious choice to use ZK for me ATM. I may not know something about ZK (or system tables?), but so far the pattern is that meta recovery is a big deal even without bugs, and with ZK we barely ever have any problems.

          Show
          Sergey Shelukhin added a comment - Btw I agree that the main point is to get rid of the complexity you mention (and in the doc I only mention storage mechanism in ZK in one paragraph in the end), so the storage mechanism choice is almost orthogonal. But as far as it is concerned, it seems an obvious choice to use ZK for me ATM. I may not know something about ZK (or system tables?), but so far the pattern is that meta recovery is a big deal even without bugs, and with ZK we barely ever have any problems.
          Hide
          Elliott Clark added a comment -

          Wrt coprocs - that is bad imho, that is not the kind of modular that we want.

          I'm not tied to it being a co-proc. But it does illustrate the idea that it can be done by watching mutations as they come into the normal hregion call stack.

          That is not due to ZK as such, that is due to multi-state-machine reconciliation model and truth in multiple places that it requires.

          In part it's due to getting zk messages out of order, and getting them delayed. Those pains are due in no small part because zk's client is single threaded.

          System table can have exact same problem of state in the table + state in memory, question is how you split and manage state between them, storage substrate doesn't matter as much.

          But you only have the one state if you have master inside of the region server hosting meta. There's no need to have a map of assignment if meta is actually just a function call away. Also The same is not true at all if you want to put state into zk. Then you need a local cache if you want to make this performant at all (That's how we got to the current state). Putting state into zk necessitates a split brain problem. There's what the master see and what the outside worlds sees.

          So, why would we write a bunch of new code to get "within an order of magnitude"?

          That code is already there, and in use. We fail over meta right now in 240ms. I was commenting on what you were saying that zk fails over faster. And that's true but for meta we've narrowed that gap significantly. So I don't think that ZK has that much of an advantage.

          I don't see an advantage, or ZK disadvantage that you mention compared to multiple advantages of ZK

          We've tried putting state into zk. That failed. I really don't want to put a whole bunch of new code into hbase that does almost exactly the same thing as we currently have. It's going to fail.

          so the storage mechanism choice is almost orthogonal.

          For me it's not just about the storage. It's about co-locating storage with the master means that these split brain problems are much rarer.

          Show
          Elliott Clark added a comment - Wrt coprocs - that is bad imho, that is not the kind of modular that we want. I'm not tied to it being a co-proc. But it does illustrate the idea that it can be done by watching mutations as they come into the normal hregion call stack. That is not due to ZK as such, that is due to multi-state-machine reconciliation model and truth in multiple places that it requires. In part it's due to getting zk messages out of order, and getting them delayed. Those pains are due in no small part because zk's client is single threaded. System table can have exact same problem of state in the table + state in memory, question is how you split and manage state between them, storage substrate doesn't matter as much. But you only have the one state if you have master inside of the region server hosting meta. There's no need to have a map of assignment if meta is actually just a function call away. Also The same is not true at all if you want to put state into zk. Then you need a local cache if you want to make this performant at all (That's how we got to the current state). Putting state into zk necessitates a split brain problem. There's what the master see and what the outside worlds sees. So, why would we write a bunch of new code to get "within an order of magnitude"? That code is already there, and in use. We fail over meta right now in 240ms. I was commenting on what you were saying that zk fails over faster. And that's true but for meta we've narrowed that gap significantly. So I don't think that ZK has that much of an advantage. I don't see an advantage, or ZK disadvantage that you mention compared to multiple advantages of ZK We've tried putting state into zk. That failed. I really don't want to put a whole bunch of new code into hbase that does almost exactly the same thing as we currently have. It's going to fail. so the storage mechanism choice is almost orthogonal. For me it's not just about the storage. It's about co-locating storage with the master means that these split brain problems are much rarer.
          Hide
          Sergey Shelukhin added a comment -

          We've tried putting state into zk. That failed. I really don't want to put a whole bunch of new code into hbase that does almost exactly the same thing as we currently have. It's going to fail.

          As I said I don't think that is true. Our problem is not state being in ZK; it is that the state is in multiple places in ZK itself for different parts of the same region's state, plus some state in master to reconcile these, plus some state that is not in ZK but only in master, plus also meta.
          I.e. not the split between ZK and master but split logical state within both and between them.

          Then you need a local cache if you want to make this performant at all (That's how we got to the current state).

          Our current state is not local cache, it's a bunch of actual state...

          I am not yet sure how bad ZK-master split brain problem will be if ZK has entire truth, let me think about it.
          When you say no split-brain inside master, do you mean master will host the meta and do all reads and writes to meta with no local intermediate state in memory?

          Show
          Sergey Shelukhin added a comment - We've tried putting state into zk. That failed. I really don't want to put a whole bunch of new code into hbase that does almost exactly the same thing as we currently have. It's going to fail. As I said I don't think that is true. Our problem is not state being in ZK; it is that the state is in multiple places in ZK itself for different parts of the same region's state, plus some state in master to reconcile these, plus some state that is not in ZK but only in master, plus also meta. I.e. not the split between ZK and master but split logical state within both and between them. Then you need a local cache if you want to make this performant at all (That's how we got to the current state). Our current state is not local cache, it's a bunch of actual state... I am not yet sure how bad ZK-master split brain problem will be if ZK has entire truth, let me think about it. When you say no split-brain inside master, do you mean master will host the meta and do all reads and writes to meta with no local intermediate state in memory?
          Hide
          stack added a comment -

          Our problem is not state being in ZK; it is that the state is in multiple places in ZK itself for different parts of the same region's state, plus some state in master to reconcile these, plus some state that is not in ZK but only in master, plus also meta.

          Master is the Actor. Having it go across a network to get/set the 'state' in a service that is non-transactional wasn't our smartest move.

          Regionservers currently report state via ZK. Master reads it from ZK. Would be better if RS just reported directly to RS.

          Show
          stack added a comment - Our problem is not state being in ZK; it is that the state is in multiple places in ZK itself for different parts of the same region's state, plus some state in master to reconcile these, plus some state that is not in ZK but only in master, plus also meta. Master is the Actor. Having it go across a network to get/set the 'state' in a service that is non-transactional wasn't our smartest move. Regionservers currently report state via ZK. Master reads it from ZK. Would be better if RS just reported directly to RS.
          Hide
          Devaraj Das added a comment -

          we still need a reliable store (ZK, system table, or master WAL). It seems ZK is the most scalable and best suited for the task

          Sergey Shelukhin, not ZK, IMHO. Let's use one of our internal storages rather than external system for storing the region state. I am all for removing ZK altogether from HBase. One less distributed system to worry about. One less component to manage. We already have heartbeats from RSs to master, and region open/close RPCs from master to the RSs. I think we have enough communication already in place between the master and RSs to deal with region states.... We also have chores in the master that tries to take some actions based on assignment timeouts...

          Would this model work (conceptually). It's late night here; please pardon me if there are glaring issues Please bear with me

          All region state manipulation operations are initiated by the master and they act upon the meta region. We have extra columns to store the state of the region etc in the meta table. The initial rows are created by the master and the state of the regions are UNASSIGNED. This is not new - we already do this but IIRC we don't store the state of the region. Some state transitions happen through method executions and some of those method executions are RPCs from the master to some regionserver. I think that the states would be more granular here (to prevent potential replay/repetitions of large operations). I am wondering whether it makes sense to update the meta table from the various regionservers on the region state changes or go via the master.. But maybe the master doesn't need to be a bottleneck if possible. A regionserver could first update the meta table, and then just notify the master that a certain transition was done; the master could initiate the next transition (Elliott Clark comment about coprocessor can probably be made to apply in this context). Only when a state change is recorded in meta, the operation is considered successful.

          Also, there is a chore (probably enhance catalog-janitor) in the master that periodically goes over the meta table and restarts (along with some diagnostics; probing regionservers in question etc.) failed/stuck state transitions. This chore runs once as soon as the master is started and the meta region is assigned to take care of transitions that were started in the previous life of the master and which are now waiting for some action from the master. For example, if the state was OPENING for a certain region, and the master crashed, the master would send a openRegion RPC to the region assignee upon restart. The region assignee would have been recorded as a column in the row for the region by the previous master.

          I think we should also save the operations that was initiated by the client on the master (either in WAL or in some system table) so that the master doesn't lose track of those and can execute them in the face of crashes & restarts. For example, if the user had sent a 'split region' operation and the master crashed.

          Show
          Devaraj Das added a comment - we still need a reliable store (ZK, system table, or master WAL). It seems ZK is the most scalable and best suited for the task Sergey Shelukhin , not ZK, IMHO. Let's use one of our internal storages rather than external system for storing the region state. I am all for removing ZK altogether from HBase. One less distributed system to worry about. One less component to manage. We already have heartbeats from RSs to master, and region open/close RPCs from master to the RSs. I think we have enough communication already in place between the master and RSs to deal with region states.... We also have chores in the master that tries to take some actions based on assignment timeouts... Would this model work (conceptually). It's late night here; please pardon me if there are glaring issues Please bear with me All region state manipulation operations are initiated by the master and they act upon the meta region. We have extra columns to store the state of the region etc in the meta table. The initial rows are created by the master and the state of the regions are UNASSIGNED. This is not new - we already do this but IIRC we don't store the state of the region. Some state transitions happen through method executions and some of those method executions are RPCs from the master to some regionserver. I think that the states would be more granular here (to prevent potential replay/repetitions of large operations). I am wondering whether it makes sense to update the meta table from the various regionservers on the region state changes or go via the master.. But maybe the master doesn't need to be a bottleneck if possible. A regionserver could first update the meta table, and then just notify the master that a certain transition was done; the master could initiate the next transition ( Elliott Clark comment about coprocessor can probably be made to apply in this context). Only when a state change is recorded in meta, the operation is considered successful. Also, there is a chore (probably enhance catalog-janitor) in the master that periodically goes over the meta table and restarts (along with some diagnostics; probing regionservers in question etc.) failed/stuck state transitions. This chore runs once as soon as the master is started and the meta region is assigned to take care of transitions that were started in the previous life of the master and which are now waiting for some action from the master. For example, if the state was OPENING for a certain region, and the master crashed, the master would send a openRegion RPC to the region assignee upon restart. The region assignee would have been recorded as a column in the row for the region by the previous master. I think we should also save the operations that was initiated by the client on the master (either in WAL or in some system table) so that the master doesn't lose track of those and can execute them in the face of crashes & restarts. For example, if the user had sent a 'split region' operation and the master crashed.
          Hide
          Nicolas Liochon added a comment -

          zk vs. non zk.

          ZK is used in HDFS HA, no? So any way we have it in our architecture. Then using it for permanent data is another discussion (stuff like ZOOKEEPER-1147 makes it interesting.
          I would personally prefer to remove the master rather than adding functions to it. Saying that there are some specific threads in the region servers holding .meta. is acceptable imho.

          Show
          Nicolas Liochon added a comment - zk vs. non zk. ZK is used in HDFS HA, no? So any way we have it in our architecture. Then using it for permanent data is another discussion (stuff like ZOOKEEPER-1147 makes it interesting. I would personally prefer to remove the master rather than adding functions to it. Saying that there are some specific threads in the region servers holding .meta. is acceptable imho.
          Hide
          Jonathan Hsieh added a comment -

          ZK is used for the selection of the primary nn (via the failure controllers) but I believe the journal nodes (that do the durable consensus logging) does not use ZK at all. Todd Lipconor Aaron T. Myers can confirm.

          Show
          Jonathan Hsieh added a comment - ZK is used for the selection of the primary nn (via the failure controllers) but I believe the journal nodes (that do the durable consensus logging) does not use ZK at all. Todd Lipcon or Aaron T. Myers can confirm.
          Hide
          Devaraj Das added a comment -

          Removing the separate master daemon is fine by me, Nicolas. However, we still need someone to do various operations (servicing user requests and other janitorial tasks). Long back we were discussing that a random region server (elected via zk) could perform the master role.

          Show
          Devaraj Das added a comment - Removing the separate master daemon is fine by me, Nicolas. However, we still need someone to do various operations (servicing user requests and other janitorial tasks). Long back we were discussing that a random region server (elected via zk) could perform the master role.
          Hide
          Jimmy Xiang added a comment -

          I prefer not to use ZK since it's kind of the root cause of uncertainty: has the master/region server got/processed the event? has the znode hijacked since master/region server changes its mind?

          We should store the state in meta table which is cached in the memory.

          Whether to use coprocessor it is not a big concern to me. If we don't use coprocessor, I prefer to use the master as the proxy to do all meta table updates. Otherwise, we need to listen to something for updates.

          We should not have another janitor/chore. If an action is failed, it must be because of something unrecoverable by itself, not because of a bug in our code. It should stay failed until the issue is resolved.

          We need to have something like FATE in accumulo to queue/retry actions taking several steps like split/merge/move.

          It is a nice-to-have to keep a history of region state transition.

          Show
          Jimmy Xiang added a comment - I prefer not to use ZK since it's kind of the root cause of uncertainty: has the master/region server got/processed the event? has the znode hijacked since master/region server changes its mind? We should store the state in meta table which is cached in the memory. Whether to use coprocessor it is not a big concern to me. If we don't use coprocessor, I prefer to use the master as the proxy to do all meta table updates. Otherwise, we need to listen to something for updates. We should not have another janitor/chore. If an action is failed, it must be because of something unrecoverable by itself, not because of a bug in our code. It should stay failed until the issue is resolved. We need to have something like FATE in accumulo to queue/retry actions taking several steps like split/merge/move. It is a nice-to-have to keep a history of region state transition.
          Hide
          Nicolas Liochon added a comment -

          However, we still need someone to do various operations (servicing user requests and other janitorial tasks).

          Yeah, I agree. Balance is a good example. Less say that I'm more comfortable w/ something that lowers the role of the master than the opposite.

          I prefer not to use ZK

          When you say this, Jimmy, do you mean "no ZK in HBase at all", or "No ZK for permanent data", or "No ZK at all for assignment"?

          We should store the state in meta table which is cached in the memory.

          I'm fine with that (if we can make it work )

          Show
          Nicolas Liochon added a comment - However, we still need someone to do various operations (servicing user requests and other janitorial tasks). Yeah, I agree. Balance is a good example. Less say that I'm more comfortable w/ something that lowers the role of the master than the opposite. I prefer not to use ZK When you say this, Jimmy, do you mean "no ZK in HBase at all", or "No ZK for permanent data", or "No ZK at all for assignment"? We should store the state in meta table which is cached in the memory. I'm fine with that (if we can make it work )
          Hide
          Jimmy Xiang added a comment -

          Nicolas, I mean no ZK for assignment.

          Show
          Jimmy Xiang added a comment - Nicolas, I mean no ZK for assignment.
          Hide
          Sergey Shelukhin added a comment -

          Big response to not-responded-to recent comments.
          Let me update the doc, EOW-ish probably depending on the number of bugs surfacing

          stack
          Let's keep discussion and doc here and branch tasks out for rewrites.

          + The problem section is too short (state kept in multiple places and all have to agree...); need more full list so can be sure proposal addresses them all

          What level of detail do you have in mind? It's not a bug fix, so I cannot really say "merge races with snapshot", or something like that; that could also be arguably resolved by another 100k patch to existing AM

          + How is the proposal different from what we currently have? I see us tying regionstate to table state. That is new. But the rest, where we have a record and it is atomically changed looks like our RegionState in Master memory? There is an increasing 'version' which should help ensure a 'direction' for change which should help.

          See the design principles (and below discussion ). We are trying to avoid multiple flavors of split-brain state.

          Its fine having a source of truth but ain't the hard part bring the system along? (meta edits, clients, etc.).

          Yes

          Experience has zk as messy to reason with. It is also an indirection having RS and M go to zk to do 'state'.

          I think ZK got a bad reputation not on its own merit, but on how we use it.
          I can see that problems exist but IMHO advantages outweigh the disadvantages compared to system table.
          Co-located system table, I am not so sure, but so far there's no even high-level design for this (for example - do all splits have to go thru master/system table now? how does it recover? etc.).
          Perhaps we should abstract an async persistence mechanism sufficiently and then decide. Whether it would be ZK+notifications, or system table, or memory + wal, or colocated system table, or what.
          The problem is that the usage inside master of that interface would depend on perf characteristics.
          Anyway, we can work out the state transitions/concurrency/recovery without tying 100% to particular store.

          + Agree that master should become a lib that any regionserver can run.

          That sounds possible.

          Nicolas Liochon

          At least, we should make this really testable, without needing to set up a zk, a set of rs and so on.

          +1, see my comment above.

          I really really really ( ) think that we need to put performances as a requirement for any implementation. For example, something like: on a cluster with 5 racks of 20 regionserver each, with 200 regions per RS,, the assignment will be completed in 1s if we lose one rack. I saw a reference to async ZK in the doc, it's great, because the performances are 10 times better.

          We can measure and improve, but I am not really sure about what exact numbers will be, at this stage (we don't even know what storage is).

          Devaraj Das

          A regionserver could first update the meta table, and then just notify the master that a certain transition was done; the master could initiate the next transition (Elliott Clark comment about coprocessor can probably be made to apply in this context). Only when a state change is recorded in meta, the operation is considered successful.

          Split, for example, requires several changes to meta. Will master be able to see them together from the hook? If master is collocated in the same RS with meta, it should be small overhead to have master RPC.

          Also, there is a chore (probably enhance catalog-janitor) in the master that periodically goes over the meta table and restarts (along with some diagnostics; probing regionservers in question etc.) failed/stuck state transitions.

          +1 on that. Transition states can indicate the start ts, and master will know when they started.

          I think we should also save the operations that was initiated by the client on the master (either in WAL or in some system table) so that the master doesn't lose track of those and can execute them in the face of crashes & restarts. For example, if the user had sent a 'split region' operation and the master crashed

          Yeah, "disable table" or "move region" are a good example. Probably we'd need ZK/system table/WAL for ongoing logical operations.

          Jimmy Xiang

          We should not have another janitor/chore. If an action is failed, it must be because of something unrecoverable by itself, not because of a bug in our code. It should stay failed until the issue is resolved.

          I think the failures meant are things like RS went away, is slow or buggy, so OPENING got stuck - someone needs to pick it up over timeout.

          We need to have something like FATE in accumulo to queue/retry actions taking several steps like split/merge/move.

          We basically need something that allows atomic state changes. HBase or ZK or mem+wal fit the bill

          Show
          Sergey Shelukhin added a comment - Big response to not-responded-to recent comments. Let me update the doc, EOW-ish probably depending on the number of bugs surfacing stack Let's keep discussion and doc here and branch tasks out for rewrites. + The problem section is too short (state kept in multiple places and all have to agree...); need more full list so can be sure proposal addresses them all What level of detail do you have in mind? It's not a bug fix, so I cannot really say "merge races with snapshot", or something like that; that could also be arguably resolved by another 100k patch to existing AM + How is the proposal different from what we currently have? I see us tying regionstate to table state. That is new. But the rest, where we have a record and it is atomically changed looks like our RegionState in Master memory? There is an increasing 'version' which should help ensure a 'direction' for change which should help. See the design principles (and below discussion ). We are trying to avoid multiple flavors of split-brain state. Its fine having a source of truth but ain't the hard part bring the system along? (meta edits, clients, etc.). Yes Experience has zk as messy to reason with. It is also an indirection having RS and M go to zk to do 'state'. I think ZK got a bad reputation not on its own merit, but on how we use it. I can see that problems exist but IMHO advantages outweigh the disadvantages compared to system table. Co-located system table, I am not so sure, but so far there's no even high-level design for this (for example - do all splits have to go thru master/system table now? how does it recover? etc.). Perhaps we should abstract an async persistence mechanism sufficiently and then decide. Whether it would be ZK+notifications, or system table, or memory + wal, or colocated system table, or what. The problem is that the usage inside master of that interface would depend on perf characteristics. Anyway, we can work out the state transitions/concurrency/recovery without tying 100% to particular store. + Agree that master should become a lib that any regionserver can run. That sounds possible. Nicolas Liochon At least, we should make this really testable, without needing to set up a zk, a set of rs and so on. +1, see my comment above. I really really really ( ) think that we need to put performances as a requirement for any implementation. For example, something like: on a cluster with 5 racks of 20 regionserver each, with 200 regions per RS,, the assignment will be completed in 1s if we lose one rack. I saw a reference to async ZK in the doc, it's great, because the performances are 10 times better. We can measure and improve, but I am not really sure about what exact numbers will be, at this stage (we don't even know what storage is). Devaraj Das A regionserver could first update the meta table, and then just notify the master that a certain transition was done; the master could initiate the next transition (Elliott Clark comment about coprocessor can probably be made to apply in this context). Only when a state change is recorded in meta, the operation is considered successful. Split, for example, requires several changes to meta. Will master be able to see them together from the hook? If master is collocated in the same RS with meta, it should be small overhead to have master RPC. Also, there is a chore (probably enhance catalog-janitor) in the master that periodically goes over the meta table and restarts (along with some diagnostics; probing regionservers in question etc.) failed/stuck state transitions. +1 on that. Transition states can indicate the start ts, and master will know when they started. I think we should also save the operations that was initiated by the client on the master (either in WAL or in some system table) so that the master doesn't lose track of those and can execute them in the face of crashes & restarts. For example, if the user had sent a 'split region' operation and the master crashed Yeah, "disable table" or "move region" are a good example. Probably we'd need ZK/system table/WAL for ongoing logical operations. Jimmy Xiang We should not have another janitor/chore. If an action is failed, it must be because of something unrecoverable by itself, not because of a bug in our code. It should stay failed until the issue is resolved. I think the failures meant are things like RS went away, is slow or buggy, so OPENING got stuck - someone needs to pick it up over timeout. We need to have something like FATE in accumulo to queue/retry actions taking several steps like split/merge/move. We basically need something that allows atomic state changes. HBase or ZK or mem+wal fit the bill
          Hide
          Sergey Shelukhin added a comment -

          btw, any input on actor model?
          Things queue up operations/notifications ("ops") for master; "AM" runs on timer or when queue is non-empty, having as inputs, cluster state (incl. ongoing internal actions it ordered before e.g. OPENING state for a region) plus new ops from queue, on a single thread; generates new actions (not physically doing anything e,g, talking to RS); the ops state and cluster state is persisted; then actions are executed on different threads (e.g. messages sent to RS-es, etc.), and "AM" runs again, or sleeps for some time if ops queue is empty.

          That is a different model, not sure if it scales for large clusters.

          Show
          Sergey Shelukhin added a comment - btw, any input on actor model? Things queue up operations/notifications ("ops") for master; "AM" runs on timer or when queue is non-empty, having as inputs, cluster state (incl. ongoing internal actions it ordered before e.g. OPENING state for a region) plus new ops from queue, on a single thread; generates new actions (not physically doing anything e,g, talking to RS); the ops state and cluster state is persisted; then actions are executed on different threads (e.g. messages sent to RS-es, etc.), and "AM" runs again, or sleeps for some time if ops queue is empty. That is a different model, not sure if it scales for large clusters.
          Hide
          Honghua Feng added a comment -

          Master is the Actor. Having it go across a network to get/set the 'state' in a service that is non-transactional wasn't our smartest move.

          Regionservers currently report state via ZK. Master reads it from ZK. Would be better if RS just reported directly to RS.
          stack Yes, this is exactly what I proposed in HBASE-9726

          I am wondering whether it makes sense to update the meta table from the various regionservers on the region state changes or go via the master.. But maybe the master doesn't need to be a bottleneck if possible. A regionserver could first update the meta table, and then just notify the master that a certain transition was done; the master could initiate the next transition

          Devaraj Das It would be better to let master updates the meta table rather than let various regionservers do it. Master being the single actor and truth-maintainer can avoid many tricky bugs/problems. And for frequent state changes to the meta table, the regionserver serving the (state) meta table would be sooner the bottleneck than master which issues the update requests, so whether it doesn't matter the update requests are from the master or from various regionservers.

          I prefer not to use ZK since it's kind of the root cause of uncertainty: has the master/region server got/processed the event? has the znode hijacked since master/region server changes its mind?

          We should store the state in meta table which is cached in the memory.
          Whether to use coprocessor it is not a big concern to me. If we don't use coprocessor, I prefer to use the master as the proxy to do all meta table updates. Otherwise, we need to listen to something for updates.
          Jimmy Xiang Agree. IMO ZK alone is not the root cause of uncertainty, the current usage pattern of ZK is the root cause, the pattern that regionserver updates state in ZK and master listens to the ZK and updates states in its local memory accordingly exhibits too many tricky scenarios/bugs due to ZK watch is one-time(which can result in missed state transition) and the notification/process is asyncronous(which can lead to delayed/non-update-to-date state in master memory). And by replacing ZK with meta table, we also need to discard this 'RS updates - master listen' pattern since meta table inherently lack listen-notify mechanism.

          I think ZK got a bad reputation not on its own merit, but on how we use it.

          I can see that problems exist but IMHO advantages outweigh the disadvantages compared to system table.
          Co-located system table, I am not so sure, but so far there's no even high-level design for this (for example - do all splits have to go thru master/system table now? how does it recover? etc.).
          Perhaps we should abstract an async persistence mechanism sufficiently and then decide. Whether it would be ZK+notifications, or system table, or memory + wal, or colocated system table, or what.
          The problem is that the usage inside master of that interface would depend on perf characteristics.
          Anyway, we can work out the state transitions/concurrency/recovery without tying 100% to particular store.
          Sergey Shelukhin Agree on "ZK got a bad reputation not on its own merit, but on how we use it.", especially if you mean currently master relies on ZK watch/notification to maintain/update master's in-memory region state. IMO this is almost the biggest root cause of current assignment design. If we just uses ZK the same way as using meta table to storing states, it makes no that big difference to store the states in ZK or meta table, right(except using meta table can have much better performance for restart of a big cluster with large amount of regions)? But using ZK's update/listen pattern does make the difference.

          btw, any input on actor model?

          Things queue up operations/notifications ("ops") for master; "AM" runs on timer or when queue is non-empty, having as inputs, cluster state (incl. ongoing internal actions it ordered before e.g. OPENING state for a region) plus new ops from queue, on a single thread; generates new actions (not physically doing anything e,g, talking to RS); the ops state and cluster state is persisted; then actions are executed on different threads (e.g. messages sent to RS-es, etc.), and "AM" runs again, or sleeps for some time if ops queue is empty.
          That is a different model, not sure if it scales for large clusters.
          Sergey Shelukhin "operations/notifications" means RS responses action progress to master? Master is the single point to update the state "truth"(to meta table) and RS doesn't know where the states are stored and doesn't access them directly, right? I think a communication/storage diagram can help a lot for an overall clear understanding here

          Show
          Honghua Feng added a comment - Master is the Actor. Having it go across a network to get/set the 'state' in a service that is non-transactional wasn't our smartest move. Regionservers currently report state via ZK. Master reads it from ZK. Would be better if RS just reported directly to RS. stack Yes, this is exactly what I proposed in HBASE-9726 I am wondering whether it makes sense to update the meta table from the various regionservers on the region state changes or go via the master.. But maybe the master doesn't need to be a bottleneck if possible. A regionserver could first update the meta table, and then just notify the master that a certain transition was done; the master could initiate the next transition Devaraj Das It would be better to let master updates the meta table rather than let various regionservers do it. Master being the single actor and truth-maintainer can avoid many tricky bugs/problems. And for frequent state changes to the meta table, the regionserver serving the (state) meta table would be sooner the bottleneck than master which issues the update requests, so whether it doesn't matter the update requests are from the master or from various regionservers. I prefer not to use ZK since it's kind of the root cause of uncertainty: has the master/region server got/processed the event? has the znode hijacked since master/region server changes its mind? We should store the state in meta table which is cached in the memory. Whether to use coprocessor it is not a big concern to me. If we don't use coprocessor, I prefer to use the master as the proxy to do all meta table updates. Otherwise, we need to listen to something for updates. Jimmy Xiang Agree. IMO ZK alone is not the root cause of uncertainty, the current usage pattern of ZK is the root cause, the pattern that regionserver updates state in ZK and master listens to the ZK and updates states in its local memory accordingly exhibits too many tricky scenarios/bugs due to ZK watch is one-time(which can result in missed state transition) and the notification/process is asyncronous(which can lead to delayed/non-update-to-date state in master memory). And by replacing ZK with meta table, we also need to discard this 'RS updates - master listen' pattern since meta table inherently lack listen-notify mechanism . I think ZK got a bad reputation not on its own merit, but on how we use it. I can see that problems exist but IMHO advantages outweigh the disadvantages compared to system table. Co-located system table, I am not so sure, but so far there's no even high-level design for this (for example - do all splits have to go thru master/system table now? how does it recover? etc.). Perhaps we should abstract an async persistence mechanism sufficiently and then decide. Whether it would be ZK+notifications, or system table, or memory + wal, or colocated system table, or what. The problem is that the usage inside master of that interface would depend on perf characteristics. Anyway, we can work out the state transitions/concurrency/recovery without tying 100% to particular store. Sergey Shelukhin Agree on "ZK got a bad reputation not on its own merit, but on how we use it.", especially if you mean currently master relies on ZK watch/notification to maintain/update master's in-memory region state. IMO this is almost the biggest root cause of current assignment design. If we just uses ZK the same way as using meta table to storing states, it makes no that big difference to store the states in ZK or meta table, right(except using meta table can have much better performance for restart of a big cluster with large amount of regions)? But using ZK's update/listen pattern does make the difference. btw, any input on actor model? Things queue up operations/notifications ("ops") for master; "AM" runs on timer or when queue is non-empty, having as inputs, cluster state (incl. ongoing internal actions it ordered before e.g. OPENING state for a region) plus new ops from queue, on a single thread; generates new actions (not physically doing anything e,g, talking to RS); the ops state and cluster state is persisted; then actions are executed on different threads (e.g. messages sent to RS-es, etc.), and "AM" runs again, or sleeps for some time if ops queue is empty. That is a different model, not sure if it scales for large clusters. Sergey Shelukhin "operations/notifications" means RS responses action progress to master? Master is the single point to update the state "truth"(to meta table) and RS doesn't know where the states are stored and doesn't access them directly, right? I think a communication/storage diagram can help a lot for an overall clear understanding here
          Hide
          Honghua Feng added a comment -

          Since HBASE-9726 is closed as duplicated with this one, I copied the proposal of HBASE-9726 here for discussion/reference:

          Current assignment process (also split process) relies on ZK for the communication between master and regionserver. This pattern has two drawbacks:
          1. For cluster with big number of regions(say, 10K-100K regions), ZK becomes the bottleneck for cluster restart since the assignment/split status/progress is stored in ZK due to ZK's limited write throughput
          2. Since ZK's watch is one-time and the event notification/process is asynchronous, there is no guarantee for master(the watcher) to be notified of the up-to-date status/progress in time, thereby master relies on idempotence for its correctness, which makes the logic/code very hard to understand/maintain

          A new assignment design proposal is as below:
          1. Assignment/split status/progress is stored in a system table(say 'assignTable') as meta table rather than ZK to improve the write throughput, hence to improve the proformance of restart for cluster with large number of regions.
          2. The communication pattern for assignment/split is changed this way: master talks directly with regionserver(master issues assign request to regionserver, regionserver responses the assign progress to master) and records the status/progress of each assignment/split in the 'assignTable', in case of master failure, new active master reads the 'assignTable' to rebuilds the knowledge of the ongoing assignmeng/split tasks and continues from that knowledge. (regionserver doesn't write to the 'assignTable')

          Show
          Honghua Feng added a comment - Since HBASE-9726 is closed as duplicated with this one, I copied the proposal of HBASE-9726 here for discussion/reference: Current assignment process (also split process) relies on ZK for the communication between master and regionserver. This pattern has two drawbacks: 1. For cluster with big number of regions(say, 10K-100K regions), ZK becomes the bottleneck for cluster restart since the assignment/split status/progress is stored in ZK due to ZK's limited write throughput 2. Since ZK's watch is one-time and the event notification/process is asynchronous, there is no guarantee for master(the watcher) to be notified of the up-to-date status/progress in time, thereby master relies on idempotence for its correctness, which makes the logic/code very hard to understand/maintain A new assignment design proposal is as below: 1. Assignment/split status/progress is stored in a system table(say 'assignTable') as meta table rather than ZK to improve the write throughput, hence to improve the proformance of restart for cluster with large number of regions. 2. The communication pattern for assignment/split is changed this way: master talks directly with regionserver(master issues assign request to regionserver, regionserver responses the assign progress to master) and records the status/progress of each assignment/split in the 'assignTable', in case of master failure, new active master reads the 'assignTable' to rebuilds the knowledge of the ongoing assignmeng/split tasks and continues from that knowledge. (regionserver doesn't write to the 'assignTable')
          Hide
          Sergey Shelukhin added a comment -

          I think it's the approach discussed above.

          I will update the doc on monday, I think I'm sold on collocated system table.
          Initially we can just run an RS that runs master library and only hosts "hardcoded" system regions as master.
          Then probably any RS (with caveats) can host the master regions and act as master, so recovery can become much easier.

          Show
          Sergey Shelukhin added a comment - I think it's the approach discussed above. I will update the doc on monday, I think I'm sold on collocated system table. Initially we can just run an RS that runs master library and only hosts "hardcoded" system regions as master. Then probably any RS (with caveats) can host the master regions and act as master, so recovery can become much easier.
          Hide
          Aaron T. Myers added a comment -

          ZK is used for the selection of the primary nn (via the failure controllers) but I believe the journal nodes (that do the durable consensus logging) does not use ZK at all. Todd Lipcon or Aaron T. Myers can confirm.

          I can confirm this. The QJM in the NN uses its own (heavily ZK-inspired) consensus protocol, but does not rely on ZK itself. The only thing HDFS currently uses ZK for is for the leader election of the active NN, as Jon says here.

          Show
          Aaron T. Myers added a comment - ZK is used for the selection of the primary nn (via the failure controllers) but I believe the journal nodes (that do the durable consensus logging) does not use ZK at all. Todd Lipcon or Aaron T. Myers can confirm. I can confirm this. The QJM in the NN uses its own (heavily ZK-inspired) consensus protocol, but does not rely on ZK itself. The only thing HDFS currently uses ZK for is for the leader election of the active NN, as Jon says here.
          Hide
          Jimmy Xiang added a comment -

          Honghua Feng, to the uncertainty due to ZK, I don't think it is because the way how we use it. It is more because ZK doesn't support continuous events. You have to set the watch again after each event callback. The problem is that after an event is triggered, when we try to get the data, the data could be changed again so an event is missed that will cause state jump.

          Currently, we do have a region state machine. However, the machine is not strict due to the ZK thing. We could jump over some state, which make the state transition machine can't be strictly enforced. If we go without ZK, we can have a strict state machine to follow. That will make things much predictable.

          Sergey Shelukhin, to the janitor, I think we don't need it. Currently, we have a timeout monitor. But it is disabled and will be removed soon I think. Without the monitor, ITBLL with CM runs very well. With 0.96 tip, I tried to run ITBLL with CM with aggressive region moving, and it is perfectly fine. If a RS is gone, SSH should handle it properly and assign regions. If there is a janitor, it will compete with SSH in this case, which probably does more harm than good.

          To make some RS to serve the role of master, besides we can have meta on it, we can have some (not all, of course, to make Jesse Yates happy ) system tables on it too. This way, we can support level region assignments, i.e. we can open some regions before the rest, if these regions can be assigned to the master RS, or we can open on this master RS at first, then move away later after system is fully started. This applies to some special regions only for sure.

          Now, we bundle two import modules (master + meta) in one RS. It is critical to make sure it has light load, not die too often (even better, not die at all). So I think we should move other regions out of the RS once it's promoted to be the master one.

          I think we should allow only a list of RS with good hardware to be master, if not all RS nodes have decent/same hardware.

          Show
          Jimmy Xiang added a comment - Honghua Feng , to the uncertainty due to ZK, I don't think it is because the way how we use it. It is more because ZK doesn't support continuous events. You have to set the watch again after each event callback. The problem is that after an event is triggered, when we try to get the data, the data could be changed again so an event is missed that will cause state jump. Currently, we do have a region state machine. However, the machine is not strict due to the ZK thing. We could jump over some state, which make the state transition machine can't be strictly enforced. If we go without ZK, we can have a strict state machine to follow. That will make things much predictable. Sergey Shelukhin , to the janitor, I think we don't need it. Currently, we have a timeout monitor. But it is disabled and will be removed soon I think. Without the monitor, ITBLL with CM runs very well. With 0.96 tip, I tried to run ITBLL with CM with aggressive region moving, and it is perfectly fine. If a RS is gone, SSH should handle it properly and assign regions. If there is a janitor, it will compete with SSH in this case, which probably does more harm than good. To make some RS to serve the role of master, besides we can have meta on it, we can have some (not all, of course, to make Jesse Yates happy ) system tables on it too. This way, we can support level region assignments, i.e. we can open some regions before the rest, if these regions can be assigned to the master RS, or we can open on this master RS at first, then move away later after system is fully started. This applies to some special regions only for sure. Now, we bundle two import modules (master + meta) in one RS. It is critical to make sure it has light load, not die too often (even better, not die at all). So I think we should move other regions out of the RS once it's promoted to be the master one. I think we should allow only a list of RS with good hardware to be master, if not all RS nodes have decent/same hardware.
          Hide
          Honghua Feng added a comment -

          Jimmy Xiang

          to the uncertainty due to ZK, I don't think it is because the way how we use it. It is more because ZK doesn't support continuous events. You have to set the watch again after each event callback. The problem is that after an event is triggered, when we try to get the data, the data could be changed again so an event is missed that will cause state jump.

          Agree. 'one-time watch' and 'asynchronous event notification' are the root cause of current AM problem ( I mentioned in above comment, you can find it ). And when I said 'because the way we use it', I meant we use ZK's watch/event mechanism: A process(RS) updates ZK, and B process(master) gets notified the update via watch event. If we use ZK just as a reliable storage, just the way of using meta table, it makes no difference we use meta table or ZK (except performance difference)
          In the theme of using meta table, we adopt another communication pattern for tasks(assign/split/merge): master requests RS to do something(and master stores the task progress/state to meta table), RS responses master of its progress periodically, master changes the task progress in both memory and meta table... ---under this theme we can use ZK to replace meta table, and avoid previous state transition miss problem as well, since we don't use ZK's watch/event mechanism, just using it as a reliable storage. right?
          Just clarify, I think we share the same understanding of this problem, you can check my above comments

          Show
          Honghua Feng added a comment - Jimmy Xiang to the uncertainty due to ZK, I don't think it is because the way how we use it. It is more because ZK doesn't support continuous events. You have to set the watch again after each event callback. The problem is that after an event is triggered, when we try to get the data, the data could be changed again so an event is missed that will cause state jump. Agree. 'one-time watch' and 'asynchronous event notification' are the root cause of current AM problem ( I mentioned in above comment, you can find it ). And when I said 'because the way we use it', I meant we use ZK's watch/event mechanism: A process(RS) updates ZK, and B process(master) gets notified the update via watch event. If we use ZK just as a reliable storage, just the way of using meta table, it makes no difference we use meta table or ZK (except performance difference) In the theme of using meta table, we adopt another communication pattern for tasks(assign/split/merge): master requests RS to do something(and master stores the task progress/state to meta table), RS responses master of its progress periodically, master changes the task progress in both memory and meta table... ---under this theme we can use ZK to replace meta table, and avoid previous state transition miss problem as well, since we don't use ZK's watch/event mechanism, just using it as a reliable storage. right? Just clarify, I think we share the same understanding of this problem, you can check my above comments
          Hide
          Jimmy Xiang added a comment -

          Good. I think we are on the same page.

          just using it as a reliable storage.

          We probably won't use ZK as a pure storage. Meta table + cache is a good alternative.

          Show
          Jimmy Xiang added a comment - Good. I think we are on the same page. just using it as a reliable storage. We probably won't use ZK as a pure storage. Meta table + cache is a good alternative.
          Hide
          Sergey Shelukhin added a comment -

          Jimmy Xiang by janitor, I mean not timeout monitor, but something picking up timeouts of non-master ops like open.
          It's a rare case and probably never happens in int tests, but there can be a case where RS is taking too long to open.

          Show
          Sergey Shelukhin added a comment - Jimmy Xiang by janitor, I mean not timeout monitor, but something picking up timeouts of non-master ops like open. It's a rare case and probably never happens in int tests, but there can be a case where RS is taking too long to open.
          Hide
          Jimmy Xiang added a comment -

          I see.

          Show
          Jimmy Xiang added a comment - I see.
          Hide
          Sergey Shelukhin added a comment -

          Ok, it's harder than I thought, I don't think I will be done today... but I think I have a clear picture now that covers the above feedback, so I am trying to cover all the failover scenarios and state conflicts.

          Show
          Sergey Shelukhin added a comment - Ok, it's harder than I thought, I don't think I will be done today... but I think I have a clear picture now that covers the above feedback, so I am trying to cover all the failover scenarios and state conflicts.
          Hide
          Eric Newton added a comment -

          Accumulo does manage tablet (region) assignment tracking through the metadata table, and further, uses a distributed state machine to scale up a little beyond a single master node. I have been meaning to write it up, but I've not had a chance.

          I've not kept up with every HBase improvement, so I don't know if it is pertinent... the accumulo metadata table is typically spread out over 50 - 100% of the available tablet (region) servers.

          Still, the metadata table, and especially the root table(t), is subject to hot-spotting on large map/reduce jobs where hundreds (or thousands) of clients are learning tablet locations at the same time. Block caching is important, but at some point massive numbers of simultaneous RPC requests to a single node cause delays, or even timeouts and failures.

          But using accumulo to store accumulo state has scaled well.

          Accumulo has 2 frameworks for master tasks:

          • master general state processing: a table should be online, assignments are recorded and servers repeatedly informed
          • FATE processing, where multi-stage operations are saved, executed and progress is re-recorded

          The first is general maintenance: keeping the system running. Tablets are assigned, unassigned and in-general balanced.

          The second allows for temporal deviance: tablets are taken offline for a merge, for example. The step-by-step allocation of resources and state are walked, each step recording progress in zookeeper.

          Show
          Eric Newton added a comment - Accumulo does manage tablet (region) assignment tracking through the metadata table, and further, uses a distributed state machine to scale up a little beyond a single master node. I have been meaning to write it up, but I've not had a chance. I've not kept up with every HBase improvement, so I don't know if it is pertinent... the accumulo metadata table is typically spread out over 50 - 100% of the available tablet (region) servers. Still, the metadata table, and especially the root table(t), is subject to hot-spotting on large map/reduce jobs where hundreds (or thousands) of clients are learning tablet locations at the same time. Block caching is important, but at some point massive numbers of simultaneous RPC requests to a single node cause delays, or even timeouts and failures. But using accumulo to store accumulo state has scaled well. Accumulo has 2 frameworks for master tasks: master general state processing: a table should be online, assignments are recorded and servers repeatedly informed FATE processing, where multi-stage operations are saved, executed and progress is re-recorded The first is general maintenance: keeping the system running. Tablets are assigned, unassigned and in-general balanced. The second allows for temporal deviance: tablets are taken offline for a merge, for example. The step-by-step allocation of resources and state are walked, each step recording progress in zookeeper.
          Hide
          stack added a comment -

          Thanks for the helpful input Eric Newton

          Show
          stack added a comment - Thanks for the helpful input Eric Newton
          Hide
          Jonathan Hsieh added a comment - - edited

          FYI, I've been looking at our support cases, and have been thinking and writing up a clean slate design for a master redesign with the problems we've faced in the field in mind. I focus a bit more on invariants necessary in the different states, state transitions with master interactions, extensibility of the model, and on the recovery strategy. It basically takes a pessimistic view of the world and if I had to summarize its spirit, I'd call it the "hbck-all-the-time".master.

          It is currently durable storage agnostic but requires atomic CAS operations (single row or single znode should be sufficient). When I re-read this thread it could use either of the implementation details described here (zk vs meta, etc). It sounds like being based in hbase is preferred so a little more thought is going in that direction. I'm working currently on examples of how to extend it for new features currently (like fast write recovery aka distributed log replay) and proving to myself that it would be immune from problems we've encountered before like double assignments, conflicting concurrent operations (especially during recovery), and regions stuck in transitions in the face of failures, hangs or juliet pauses.

          I read Sergey's doc after my first cut and while there are some similarities it deviates in other places. (I definitely want more on the error recovery and error prevention mechanics). My hope is to share it sometime this later week so that folks can read, discuss and compare the different designs presented at the upcoming dev meeting. Before any jirae are file for implementation the design should also consider things like upgrades, compatibility and performance.

          I'm also hoping I'll have time to take a look at the accumulo master's design as well for the discussion.

          Show
          Jonathan Hsieh added a comment - - edited FYI, I've been looking at our support cases, and have been thinking and writing up a clean slate design for a master redesign with the problems we've faced in the field in mind. I focus a bit more on invariants necessary in the different states, state transitions with master interactions, extensibility of the model, and on the recovery strategy. It basically takes a pessimistic view of the world and if I had to summarize its spirit, I'd call it the "hbck-all-the-time".master. It is currently durable storage agnostic but requires atomic CAS operations (single row or single znode should be sufficient). When I re-read this thread it could use either of the implementation details described here (zk vs meta, etc). It sounds like being based in hbase is preferred so a little more thought is going in that direction. I'm working currently on examples of how to extend it for new features currently (like fast write recovery aka distributed log replay) and proving to myself that it would be immune from problems we've encountered before like double assignments, conflicting concurrent operations (especially during recovery), and regions stuck in transitions in the face of failures, hangs or juliet pauses. I read Sergey's doc after my first cut and while there are some similarities it deviates in other places. (I definitely want more on the error recovery and error prevention mechanics). My hope is to share it sometime this later week so that folks can read, discuss and compare the different designs presented at the upcoming dev meeting. Before any jirae are file for implementation the design should also consider things like upgrades, compatibility and performance. I'm also hoping I'll have time to take a look at the accumulo master's design as well for the discussion.
          Hide
          Sergey Shelukhin added a comment -

          Jonathan Hsieh I am writing out very detailed operation and failover descriptions right now

          Show
          Sergey Shelukhin added a comment - Jonathan Hsieh I am writing out very detailed operation and failover descriptions right now
          Hide
          Jonathan Hsieh added a comment -

          Sergey Shelukhin Looking forward to it!

          Show
          Jonathan Hsieh added a comment - Sergey Shelukhin Looking forward to it!
          Hide
          Enis Soztutar added a comment -

          I also started a document some time ago, but never got to finish it to the level of details I would like. However, I think we can agree on the design goals section which I augmented from the discussion so far:

          • Robust implementation
          • Compressive test coverage by mocking server and region assignment states (unit testable without MiniCluster and CM stuff)
          • Bulk region operations
          • Region operations should be isolated from server operations (AM vs SSH, log splitting), and table operations (disabling / disabled table, schema changes, etc) and cluster shutdown. AM and SSH should NEVER know about table state (disable/disabling). Server liveness checks can only be done as an optimization (servers can fail after the check is done)
          • There should be one source of truth
          • Should be compatible with master failover, and concurrent region operations(split, RS failover, balancer, etc)
          • AM should guarantee that a region can be hosted by a single region server at any given time
          • AM should be understandable by simple human beings like myself
          • Actions for AM should be logged (possibly separately). We would like to be able to construct the history for the regions from logs or some persisted state.
          • Assignment should be performant and parallelizable. We should target handling millions of regions and thousands of servers. A single region assignment should complete under 1 sec. (1PB data with 1 GB regions = 1M regions)
          • No master abort when a region’s state cannot be determined. This results in support cases where master cannot start, and without master things become even worse. We should “quarantine” the regions if needed absolutely.
          Show
          Enis Soztutar added a comment - I also started a document some time ago, but never got to finish it to the level of details I would like. However, I think we can agree on the design goals section which I augmented from the discussion so far: Robust implementation Compressive test coverage by mocking server and region assignment states (unit testable without MiniCluster and CM stuff) Bulk region operations Region operations should be isolated from server operations (AM vs SSH, log splitting), and table operations (disabling / disabled table, schema changes, etc) and cluster shutdown. AM and SSH should NEVER know about table state (disable/disabling). Server liveness checks can only be done as an optimization (servers can fail after the check is done) There should be one source of truth Should be compatible with master failover, and concurrent region operations(split, RS failover, balancer, etc) AM should guarantee that a region can be hosted by a single region server at any given time AM should be understandable by simple human beings like myself Actions for AM should be logged (possibly separately). We would like to be able to construct the history for the regions from logs or some persisted state. Assignment should be performant and parallelizable. We should target handling millions of regions and thousands of servers. A single region assignment should complete under 1 sec. (1PB data with 1 GB regions = 1M regions) No master abort when a region’s state cannot be determined. This results in support cases where master cannot start, and without master things become even worse. We should “quarantine” the regions if needed absolutely.
          Hide
          Sergey Shelukhin added a comment -

          attaching preliminary version... still need to iron out all operation details, where step-by-step operation starts many things may be wrong, but it should be good until then.

          Sorry Mac craps out when printing this as PDF due to mix of page orientations, let me try to find some better format for final version...

          Show
          Sergey Shelukhin added a comment - attaching preliminary version... still need to iron out all operation details, where step-by-step operation starts many things may be wrong, but it should be good until then. Sorry Mac craps out when printing this as PDF due to mix of page orientations, let me try to find some better format for final version...
          Hide
          stack added a comment -

          @enis List makes for pretty good set of requirements. We used to talk 100k regions but folks are long past that now so we are behind the curve (Flurry are >250k IIRC) and we may want to tend away from a few large regions and more toward many small regions if we can get AM to perform (advantages: smaller compression runs, easier to free up WALs, etc)

          Show
          stack added a comment - @enis List makes for pretty good set of requirements. We used to talk 100k regions but folks are long past that now so we are behind the curve (Flurry are >250k IIRC) and we may want to tend away from a few large regions and more toward many small regions if we can get AM to perform (advantages: smaller compression runs, easier to free up WALs, etc)
          Hide
          Nicolas Liochon added a comment -

          +1 for Enis' requirements list .
          I tend to think that AM and meta should be collocated.

          Show
          Nicolas Liochon added a comment - +1 for Enis' requirements list . I tend to think that AM and meta should be collocated.
          Hide
          Nick Dimiduk added a comment -

          I'm also a fan of Enis's list, particularly "AM should be understandable by simple human beings like myself."

          The observation I'll add here is that AM and meta don't necessarily need to be collocated. What is necessary is that AM maintain a strongly consistent view of the world, at least from what I understand about the current design. That requirement can be relaxed iff there's an explicitly distributed state management system. Such a system is probably composed out of idempotent operations over CRDTs.

          I also question the wisdom of moving away from ZK for management of active cluster state, primarily because in our current architecture, that component is completely out of band of data operations. Meaning, the activities which put stress on the configuration consensus bits are different from the operations that put stress on a data provider. (Yes, data activity results in region relocation, but that's a maintenance task, not direct involvement.) Moving to dependency on collocation unnecessarily conflates those two aspects of the system.

          If the issues with Zookeeper originate from implementation details, why not fix implementation rather than look to a new architecture? For instance, the CoreOS folk have a little something called etcd. Raft specifically may not provide the correct kind of available consensus we need; the idea is to examine both the baby and the bathwater.

          Show
          Nick Dimiduk added a comment - I'm also a fan of Enis's list, particularly "AM should be understandable by simple human beings like myself." The observation I'll add here is that AM and meta don't necessarily need to be collocated. What is necessary is that AM maintain a strongly consistent view of the world, at least from what I understand about the current design. That requirement can be relaxed iff there's an explicitly distributed state management system. Such a system is probably composed out of idempotent operations over CRDTs. I also question the wisdom of moving away from ZK for management of active cluster state, primarily because in our current architecture, that component is completely out of band of data operations. Meaning, the activities which put stress on the configuration consensus bits are different from the operations that put stress on a data provider. (Yes, data activity results in region relocation, but that's a maintenance task, not direct involvement.) Moving to dependency on collocation unnecessarily conflates those two aspects of the system. If the issues with Zookeeper originate from implementation details, why not fix implementation rather than look to a new architecture? For instance, the CoreOS folk have a little something called etcd . Raft specifically may not provide the correct kind of available consensus we need; the idea is to examine both the baby and the bathwater.
          Hide
          Nicolas Liochon added a comment -

          AM and meta don't necessarily need to be collocated

          If there are separated, you double the failure probability, as you need both AM and .META. to work. Moreover, speaking to .meta. becomes a distributed problem, while its less the case when they are collocated (only less because of HDFS).

          moving away from ZK for management

          I believe we will need it to determine who is the AM lead. I don't really know about storing in zookeeper vs. meta. As Jimmy said using zookeeper to do rpc calls seems wrong however.

          I guess this can be decided later. For the requirements, I don't have anything to add to Enis' list.

          Show
          Nicolas Liochon added a comment - AM and meta don't necessarily need to be collocated If there are separated, you double the failure probability, as you need both AM and .META. to work. Moreover, speaking to .meta. becomes a distributed problem, while its less the case when they are collocated (only less because of HDFS). moving away from ZK for management I believe we will need it to determine who is the AM lead. I don't really know about storing in zookeeper vs. meta. As Jimmy said using zookeeper to do rpc calls seems wrong however. I guess this can be decided later. For the requirements, I don't have anything to add to Enis' list.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Started going through this document. With my experience with AM definitely the number of states we have and the dependency on ZK callback makes things bit difficult to understand and track and the state of truth is spread across.
          In the doc, for the create table scenario there are cases where the Create table failure on master abort will result in a table creation that has lesser number of regions actually specified by the clients in the split.
          The master failover part is another critical area as how we collect the alive and dead RS list and the list of Regions that were partially in either opening/closing and splitting. It is this failiure condition where we end up in lot of hidden areas.
          Will read the document and share the ideas if any.

          Show
          ramkrishna.s.vasudevan added a comment - Started going through this document. With my experience with AM definitely the number of states we have and the dependency on ZK callback makes things bit difficult to understand and track and the state of truth is spread across. In the doc, for the create table scenario there are cases where the Create table failure on master abort will result in a table creation that has lesser number of regions actually specified by the clients in the split. The master failover part is another critical area as how we collect the alive and dead RS list and the list of Regions that were partially in either opening/closing and splitting. It is this failiure condition where we end up in lot of hidden areas. Will read the document and share the ideas if any.
          Hide
          ramkrishna.s.vasudevan added a comment -

          HBASE-5583 is one such JIRA that handles create table failure cases.

          Show
          ramkrishna.s.vasudevan added a comment - HBASE-5583 is one such JIRA that handles create table failure cases.
          Hide
          Sergey Shelukhin added a comment -

          I don't think it can happen on create. Until all regions are moved to Closed state after being created (atomically via multi-row tx), table won't leave Creating state. If there's failover all regions are erased and created from scratch. Create table is rare enough for that to work.

          Enis Soztutar Wrt req list, mostly agree, however:

          Bulk region operations

          Can you please elaborate? Is it the same as modifying several regions' state under multi-row lock?

          Region operations should be isolated from [snip] table operations (disabling / disabled table, schema changes, etc) and cluster shutdown. AM [snip] should NEVER know about table state (disable/disabling).

          Strongly disagree with this. If we are doing bunch of balancing and user disables a table at the same time, we have to handle it.
          If user tries to force-assign regions of a table that is halfway thru create, we have to handle this.
          For alter, we need to reopen regions, which will have to work w/splits and merges (it's covered in my doc).
          For what purpose do you want to isolate them?
          AM should not know about details e.g. schema logic, but it should know about logistics.

          No master abort when a region’s state cannot be determined. This results in support cases where master cannot start, and without master things become even worse. We should “quarantine” the regions if needed absolutely.

          That is dangerous. IIRC in my spec I only put master abort if somebody changes table state under master; but in general, if region is in unknown state it's better to make admin act, than to just silently "disappear" part of data - that can lead to wrong results.
          Perhaps table needs to be quaranteened then.

          Show
          Sergey Shelukhin added a comment - I don't think it can happen on create. Until all regions are moved to Closed state after being created (atomically via multi-row tx), table won't leave Creating state. If there's failover all regions are erased and created from scratch. Create table is rare enough for that to work. Enis Soztutar Wrt req list, mostly agree, however: Bulk region operations Can you please elaborate? Is it the same as modifying several regions' state under multi-row lock? Region operations should be isolated from [snip] table operations (disabling / disabled table, schema changes, etc) and cluster shutdown. AM [snip] should NEVER know about table state (disable/disabling). Strongly disagree with this. If we are doing bunch of balancing and user disables a table at the same time, we have to handle it. If user tries to force-assign regions of a table that is halfway thru create, we have to handle this. For alter, we need to reopen regions, which will have to work w/splits and merges (it's covered in my doc). For what purpose do you want to isolate them? AM should not know about details e.g. schema logic, but it should know about logistics. No master abort when a region’s state cannot be determined. This results in support cases where master cannot start, and without master things become even worse. We should “quarantine” the regions if needed absolutely. That is dangerous. IIRC in my spec I only put master abort if somebody changes table state under master; but in general, if region is in unknown state it's better to make admin act, than to just silently "disappear" part of data - that can lead to wrong results. Perhaps table needs to be quaranteened then.
          Hide
          Sergey Shelukhin added a comment -

          One more update from discussion here:

          • we currently have many operations that cannot be monitored other than by side effects (create table), or at all. We need good way for user to wait for operations. Given that we send request to master, and many operations can recover from master failure, we cannot use simple async API with request and async response (at least not on the lowest level - client library can hide master failover and provide that API). The lowest-level master API should involve some sort of persistent operation cookie, so that you could still wait for operation after failover.
          Show
          Sergey Shelukhin added a comment - One more update from discussion here: we currently have many operations that cannot be monitored other than by side effects (create table), or at all. We need good way for user to wait for operations. Given that we send request to master, and many operations can recover from master failure, we cannot use simple async API with request and async response (at least not on the lowest level - client library can hide master failover and provide that API). The lowest-level master API should involve some sort of persistent operation cookie, so that you could still wait for operation after failover.
          Hide
          stack added a comment -

          Are we conflating functionality here (going by last comment above by Sergey)? There is AM and then there is another facility that uses AM to run sequences of steps to achieve an end (e.g. enable table)? Or is the notion that a revamped AM would do all? The long-running (enable a table w/ 1M regions) and short-term (assign region)? If it is to do both, I suggest we call the new facility GOD.

          Show
          stack added a comment - Are we conflating functionality here (going by last comment above by Sergey)? There is AM and then there is another facility that uses AM to run sequences of steps to achieve an end (e.g. enable table)? Or is the notion that a revamped AM would do all? The long-running (enable a table w/ 1M regions) and short-term (assign region)? If it is to do both, I suggest we call the new facility GOD.
          Hide
          Sergey Shelukhin added a comment - - edited

          You'd need some way to connect these "end" with what AM is doing, so AM will have to support operations attached to its actions even if there's separate operation management.
          Moreover, you'd find out that these are not "steps", they are state goals.
          For example, if you are disabling table, you want to close regions. So in case of separate operation manager, you might create tasks to close all regions. But what if some server fails? Now some of your regions are already closed. Separate operations to close region might fail now, but the goal is achieved. If I start disabling table and then kill all RS-es, the table is now disabled But all operations would fail.
          State goals fit much more naturally in AM than "steps". I want to avoid steps as much as possible.

          Stateful (as in, having separate state) multi-step operations are also hard to coordinate. In the above example, during recovery, you don't want to reopen region if the table is disabling, but you don't know until it's actually disabled if the table disable is an external operation.

          Show
          Sergey Shelukhin added a comment - - edited You'd need some way to connect these "end" with what AM is doing, so AM will have to support operations attached to its actions even if there's separate operation management. Moreover, you'd find out that these are not "steps", they are state goals. For example, if you are disabling table, you want to close regions. So in case of separate operation manager, you might create tasks to close all regions. But what if some server fails? Now some of your regions are already closed. Separate operations to close region might fail now, but the goal is achieved. If I start disabling table and then kill all RS-es, the table is now disabled But all operations would fail. State goals fit much more naturally in AM than "steps". I want to avoid steps as much as possible. Stateful (as in, having separate state) multi-step operations are also hard to coordinate. In the above example, during recovery, you don't want to reopen region if the table is disabling, but you don't know until it's actually disabled if the table disable is an external operation.
          Hide
          Enis Soztutar added a comment -

          I think as a mental exercise to validate the new design, we should think about the cases for the following issues opened recently so that we can ensure that these classes of problems are eliminated:

          • HBASE-9724 Failed region split is not handled correctly by AM
          • HBASE-9721 meta assignment did not timeout
          • HBASE-9696 Master recovery ignores online merge znode
          • HBASE-9777 Two consecutive RS crashes could lead to their SSH stepping on each other's toes and cause master abort
          • HBASE-9773 Master aborted when hbck asked the master to assign a region that was already online
          • HBASE-9525 "Move" region right after a region split is dangerous
          • HBASE-9514 Prevent region from assigning before log splitting is done
          • HBASE-9480 Regions are unexpectedly made offline in certain failure conditions
          • HBASE-9387 Region could get lost during assignment

          Can you please elaborate? Is it the same as modifying several regions' state under multi-row lock?

          Bulk loading requirement is there, so that we do multiple operations in parallel, sending openRegions rpcs for multiple regions at the same time, and not doing one-by-one assignment. That is all.

          That is dangerous. IIRC in my spec I only put master abort if somebody changes table state under master; but in general, if region is in unknown state it's better to make admin act, than to just silently "disappear" part of data - that can lead to wrong results.

          Quaranteing the table or region is fine, but master should not be down because of this (for example, a region can fail to open and you would want to track how many times the region failed to open so that you can decide at some point that the region should be quarantened state (or failed open state). I think there was some issue the region bouncing from server to server indefinitely.

          For table operations intermixing with region operations, I'll have to read your updated doc.

          Show
          Enis Soztutar added a comment - I think as a mental exercise to validate the new design, we should think about the cases for the following issues opened recently so that we can ensure that these classes of problems are eliminated: HBASE-9724 Failed region split is not handled correctly by AM HBASE-9721 meta assignment did not timeout HBASE-9696 Master recovery ignores online merge znode HBASE-9777 Two consecutive RS crashes could lead to their SSH stepping on each other's toes and cause master abort HBASE-9773 Master aborted when hbck asked the master to assign a region that was already online HBASE-9525 "Move" region right after a region split is dangerous HBASE-9514 Prevent region from assigning before log splitting is done HBASE-9480 Regions are unexpectedly made offline in certain failure conditions HBASE-9387 Region could get lost during assignment Can you please elaborate? Is it the same as modifying several regions' state under multi-row lock? Bulk loading requirement is there, so that we do multiple operations in parallel, sending openRegions rpcs for multiple regions at the same time, and not doing one-by-one assignment. That is all. That is dangerous. IIRC in my spec I only put master abort if somebody changes table state under master; but in general, if region is in unknown state it's better to make admin act, than to just silently "disappear" part of data - that can lead to wrong results. Quaranteing the table or region is fine, but master should not be down because of this (for example, a region can fail to open and you would want to track how many times the region failed to open so that you can decide at some point that the region should be quarantened state (or failed open state). I think there was some issue the region bouncing from server to server indefinitely. For table operations intermixing with region operations, I'll have to read your updated doc.
          Hide
          Sergey Shelukhin added a comment -

          Ok, I split the doc in half. That way it will be easier to read and manager.
          Part 1 is ready (as a current version), and describes high level design, operation semantics and interaction (I think the latter might be interesting for Jonathan Hsieh
          It also tries to capture the requirement lists above and high-level implementation (whatever is agreed upon to some degree).
          Please tell me if something is missing or wrong.

          Part 2 I will keep attaching updates. It covers the design of operations - state machines, exact steps, how client tracks it, how recovery works, etc. It will follow part 1.

          Show
          Sergey Shelukhin added a comment - Ok, I split the doc in half. That way it will be easier to read and manager. Part 1 is ready (as a current version), and describes high level design, operation semantics and interaction (I think the latter might be interesting for Jonathan Hsieh It also tries to capture the requirement lists above and high-level implementation (whatever is agreed upon to some degree). Please tell me if something is missing or wrong. Part 2 I will keep attaching updates. It covers the design of operations - state machines, exact steps, how client tracks it, how recovery works, etc. It will follow part 1.
          Hide
          Eric Newton added a comment -

          I'm sorry for asking such a basic question... could someone please comment: what does AM stands for?

          I did a quick search through the ticket and the attachments and it didn't pop out at me.

          Show
          Eric Newton added a comment - I'm sorry for asking such a basic question... could someone please comment: what does AM stands for? I did a quick search through the ticket and the attachments and it didn't pop out at me.
          Hide
          Sergey Shelukhin added a comment -

          AssignmentManager, a class in HBase master. Often but not always, when talking about it people also imply bunch of auxiliary classes around it like ServerShutdownHandler, RegionClosed/OpenedHandler, ZKTable, etc. Which together implement region assignment in HBase

          Show
          Sergey Shelukhin added a comment - AssignmentManager, a class in HBase master. Often but not always, when talking about it people also imply bunch of auxiliary classes around it like ServerShutdownHandler, RegionClosed/OpenedHandler, ZKTable, etc. Which together implement region assignment in HBase
          Hide
          Honghua Feng added a comment -

          I also question the wisdom of moving away from ZK for management of active cluster state...If the issues with Zookeeper originate from implementation details, why not fix implementation rather than look to a new architecture?

          Using system table rather than ZK to store state info is for better (cluster restart) performance for big cluster with such as 250K regions. Certainly if we change the way of using ZK ( let master be the single point to read/write ZK, not using ZK's watch/notify mechanism), no correctness/logic difference between using system table and using ZK

          Show
          Honghua Feng added a comment - I also question the wisdom of moving away from ZK for management of active cluster state...If the issues with Zookeeper originate from implementation details, why not fix implementation rather than look to a new architecture? Using system table rather than ZK to store state info is for better (cluster restart) performance for big cluster with such as 250K regions. Certainly if we change the way of using ZK ( let master be the single point to read/write ZK, not using ZK's watch/notify mechanism), no correctness/logic difference between using system table and using ZK
          Hide
          Sergey Shelukhin added a comment -

          The doc hasn't been out for long; just clarifying - anyone interested in providing feedback for part 1?
          It'd be really nice to start working out implementation details in part 2 with some confidence, and/or writing code. Should I assume lack of interest or silent agreement to rewrite according to part 1?

          Show
          Sergey Shelukhin added a comment - The doc hasn't been out for long; just clarifying - anyone interested in providing feedback for part 1? It'd be really nice to start working out implementation details in part 2 with some confidence, and/or writing code. Should I assume lack of interest or silent agreement to rewrite according to part 1?
          Hide
          Jonathan Hsieh added a comment -

          I'm doing a pass, will provide feedback tomorrow.

          Show
          Jonathan Hsieh added a comment - I'm doing a pass, will provide feedback tomorrow.
          Hide
          Devaraj Das added a comment -

          Quick comments:
          1. "Master knows of all external updates to the system store" - Are there such updates happening without master's knowledge
          2. I presume once the client is told an operation is accepted, it would be saved/queued somewhere so even if a different node picks up the master's duties, it can execute the operation. Related to that is that the master should be able to get back with the correct return code for the operation even in the case of fail-overs. Also, the "master" could have triggered some operations like shutdown handling that should be completed.
          3. I think we should support asynchronous operations (submit an operation and check periodically or something). There is no guarantee when a certain operation will complete especially when the operation requires co-ordination with other nodes and/or the node is falling behind in executing operations. We shouldn't force the model to be synchronous (we do not want to hold up precious node resources which we will in synchronous mode).
          4. Maybe, we should explicitly state handling cases where the "master" sends a region operation to a regionserver and the regionserver doesn't get back within some timeout, as one of the requirements. Fencing the regionserver etc are the possible actions when this happens.
          5. Should we fail-fast on the client side in case of conflicts? For example, if a client issued "drop table" and this operation is in progress. Another client comes in and says "create table" with the same name. We should allow clients to read the store without going through the "master".
          6. Wondering whether we need to differentiate priorities/ordering/etc. for operations like "move region" initiated by the master/balancer versus initiated by the user. Who wins, etc. These operations are "advanced" and won't be commonplace but worth calling it out?

          Show
          Devaraj Das added a comment - Quick comments: 1. "Master knows of all external updates to the system store" - Are there such updates happening without master's knowledge 2. I presume once the client is told an operation is accepted, it would be saved/queued somewhere so even if a different node picks up the master's duties, it can execute the operation. Related to that is that the master should be able to get back with the correct return code for the operation even in the case of fail-overs. Also, the "master" could have triggered some operations like shutdown handling that should be completed. 3. I think we should support asynchronous operations (submit an operation and check periodically or something). There is no guarantee when a certain operation will complete especially when the operation requires co-ordination with other nodes and/or the node is falling behind in executing operations. We shouldn't force the model to be synchronous (we do not want to hold up precious node resources which we will in synchronous mode). 4. Maybe, we should explicitly state handling cases where the "master" sends a region operation to a regionserver and the regionserver doesn't get back within some timeout, as one of the requirements. Fencing the regionserver etc are the possible actions when this happens. 5. Should we fail-fast on the client side in case of conflicts? For example, if a client issued "drop table" and this operation is in progress. Another client comes in and says "create table" with the same name. We should allow clients to read the store without going through the "master". 6. Wondering whether we need to differentiate priorities/ordering/etc. for operations like "move region" initiated by the master/balancer versus initiated by the user. Who wins, etc. These operations are "advanced" and won't be commonplace but worth calling it out?
          Hide
          Jonathan Hsieh added a comment - - edited

          Yesterday, I shared with sergey and some of the folks interested this a draft of the design I've been working on (I'll call it the hbck-master) and a list of questions related to Sergey's design. Since sergey's has got master5 in the name of the doc I'll refer to it as "master5". He's answered some question in email but we should do technical discussions out here. We'll be working together to hash out holes in each others designs and potentially merge designs.


          I have a lot of questions. I'll hit the big questions first. Also would i be possible to put a version of this up as gdoc so we can point out nits and places that need minor clarification? (I have a marked up physical copy version of the doc, would be easier to provide feedback).

          Main Concerns:

          What is a failure and how do you react to failures? I think the master5 design needs to spend more effort to considering failure and recovery cases. I claim there are 4 types of responses from a networked IO operation - two states we normally deal with ack successful, ack failed (nack) and unknown due to timeout that succeeded (timeout success) and unknown due to timeout that failed (timeout failed). We have historically missed the last two timeout cases or assumed timeout means failure nack. It seems that master5 makes the same assumptions.

          I'm very concerned about what we need to do to invalidate information cached RS information at clients in the case of hang, and that will violate the isolation guarantees that we claim to provide. I really want a slice in-depth failure handling case analysis including a client with cached rs assignments for move and something more complicated such as split or alter.

          I really want more invariant specified for the FSM states. e.g. if a region is in state X, does it have a row in meta? does have data on the FS? is it open on another region? is it open on only one region? I think having 8 pages of tables at the back of the master5 doc can be more concise and precise which will help us get attempt to prove correctness.

          Clarification questions:

          1) State update coordination. What is a "state updates from the outside" Do RS's initiate splitting on their own? Maybe a picture would help so we can figure out if it is similar or different from hbck-master's?

          2) Single point of truth. What is this truth? what the user specficied actions? what the rs's are reporting? the last state we were confirmed to be at? hbck-master tries to define what single point of truth means by defining intended, current, and actual state data with durability properties on each kind. What do clients look at who modifies what?

          3) Table record: "if regions is out of date, it should be closed and reopened". It is not clear in master5 how regionservers find out that they are out of date. Moreover, how do clients talking to those RS's with stale versions know they are going to the correct RS especially in the face of RS failures due to timeout?

          4) region record: transition states. Shouldn't be defined as part of the region record? (This is really similar to hbck-masters current state and intended state. )

          5) Note on user operations: the forgetting thing is scary to me – in your move split example, what happens if an RS reads state that is forgotten?

          6) table state machine. how do we guarantee clients are not writing to against out of date region versions? (in hang situations, regions could be open on multple places – the hung RS and the new RS the region was assigned to and successfully opened on)

          7) region state machine. Earlier draft hand splitting and merge cases. Are they elided in master5 or are not present any more. How would this get extended handle jeffrey's distributed log replay/fast write recovery feature?

          8) logical interactions: sounds like master5 allows concurrent operations in specfiic regions and and specfiic table. (e.g. it will allow moves and splits and merges on the same region). hbck-master (though not fully documented) only allows certain region transitions when the table is enabled or if the table is disabled. Are we sure we don't get into race conditions? What happens if disable gets issued – its possible for someone to reopens the region and for old clients to continue writing to it even though it is closed?

          nit. 9) "in cursive" mean in italics.

          10) The table operations section have tables which I believe are the actions between FSM states in the table or region fsms. Is this correct? Can the edges be labeled to describe which steps these transitions correspond to?

          Short doc:
          nit: Design Constraints, code should: Have AM logic isolated from the persistent storage of state.
          // I think this should be "abstracted" so we can plug in different implementations of persistent storage of state.

          Show
          Jonathan Hsieh added a comment - - edited Yesterday, I shared with sergey and some of the folks interested this a draft of the design I've been working on (I'll call it the hbck-master) and a list of questions related to Sergey's design. Since sergey's has got master5 in the name of the doc I'll refer to it as "master5". He's answered some question in email but we should do technical discussions out here. We'll be working together to hash out holes in each others designs and potentially merge designs. I have a lot of questions. I'll hit the big questions first. Also would i be possible to put a version of this up as gdoc so we can point out nits and places that need minor clarification? (I have a marked up physical copy version of the doc, would be easier to provide feedback). Main Concerns: What is a failure and how do you react to failures? I think the master5 design needs to spend more effort to considering failure and recovery cases. I claim there are 4 types of responses from a networked IO operation - two states we normally deal with ack successful, ack failed (nack) and unknown due to timeout that succeeded (timeout success) and unknown due to timeout that failed (timeout failed). We have historically missed the last two timeout cases or assumed timeout means failure nack. It seems that master5 makes the same assumptions. I'm very concerned about what we need to do to invalidate information cached RS information at clients in the case of hang, and that will violate the isolation guarantees that we claim to provide. I really want a slice in-depth failure handling case analysis including a client with cached rs assignments for move and something more complicated such as split or alter. I really want more invariant specified for the FSM states. e.g. if a region is in state X, does it have a row in meta? does have data on the FS? is it open on another region? is it open on only one region? I think having 8 pages of tables at the back of the master5 doc can be more concise and precise which will help us get attempt to prove correctness. Clarification questions: 1) State update coordination. What is a "state updates from the outside" Do RS's initiate splitting on their own? Maybe a picture would help so we can figure out if it is similar or different from hbck-master's? 2) Single point of truth. What is this truth? what the user specficied actions? what the rs's are reporting? the last state we were confirmed to be at? hbck-master tries to define what single point of truth means by defining intended, current, and actual state data with durability properties on each kind. What do clients look at who modifies what? 3) Table record: "if regions is out of date, it should be closed and reopened". It is not clear in master5 how regionservers find out that they are out of date. Moreover, how do clients talking to those RS's with stale versions know they are going to the correct RS especially in the face of RS failures due to timeout? 4) region record: transition states. Shouldn't be defined as part of the region record? (This is really similar to hbck-masters current state and intended state. ) 5) Note on user operations: the forgetting thing is scary to me – in your move split example, what happens if an RS reads state that is forgotten? 6) table state machine. how do we guarantee clients are not writing to against out of date region versions? (in hang situations, regions could be open on multple places – the hung RS and the new RS the region was assigned to and successfully opened on) 7) region state machine. Earlier draft hand splitting and merge cases. Are they elided in master5 or are not present any more. How would this get extended handle jeffrey's distributed log replay/fast write recovery feature? 8) logical interactions: sounds like master5 allows concurrent operations in specfiic regions and and specfiic table. (e.g. it will allow moves and splits and merges on the same region). hbck-master (though not fully documented) only allows certain region transitions when the table is enabled or if the table is disabled. Are we sure we don't get into race conditions? What happens if disable gets issued – its possible for someone to reopens the region and for old clients to continue writing to it even though it is closed? nit. 9) "in cursive" mean in italics. 10) The table operations section have tables which I believe are the actions between FSM states in the table or region fsms. Is this correct? Can the edges be labeled to describe which steps these transitions correspond to? Short doc: nit: Design Constraints, code should: Have AM logic isolated from the persistent storage of state. // I think this should be "abstracted" so we can plug in different implementations of persistent storage of state.
          Hide
          Jonathan Hsieh added a comment -

          Long version of the "hbck-master" design. Note that is design level only and its goal is to try to convince us about correctness. It is long, but you can get the gist of it reading the first 4 pages. It gets more detailed about region state machines and transitions and failure handling after that (including some proof sketches, how extensions would work, things ignored thus far). I'm fairly certain there are still bugs in that.

          There are some ideas explored more deeply in hbck-master and there are others explored more in master5. I think there are also a few places where I need clarification to see if we are in the same place or different places.

          Show
          Jonathan Hsieh added a comment - Long version of the "hbck-master" design. Note that is design level only and its goal is to try to convince us about correctness. It is long, but you can get the gist of it reading the first 4 pages. It gets more detailed about region state machines and transitions and failure handling after that (including some proof sketches, how extensions would work, things ignored thus far). I'm fairly certain there are still bugs in that. There are some ideas explored more deeply in hbck-master and there are others explored more in master5. I think there are also a few places where I need clarification to see if we are in the same place or different places.
          Hide
          Sergey Shelukhin added a comment - - edited

          Answers lifted from email also (some fixes + one answer was modified due to clarification here ).

          What is a failure and how do you react to failures? I think the master5 design needs to spend more effort to considering failure and recovery cases. I claim there are 4 types of responses from a networked IO operation - two states we normally deal with ack successful, ack failed (nack) and unknown due to timeout that succeeded (timeout success) and unknown due to timeout that failed (timeout failed). We have historically missed the last two cases and they aren't considered in the master5 design.

          There are a few considerations. Let me examine if there are other cases than these.
          I am assuming the collocated table, which should reduce such cases for state (probably, if collocated table cannot be written reliably, master must stop-the-world and fail over).
          When RS contacts master to do state update, it errs on the side of caution - no state update, no open region (or split).
          Thus, except for the case of multiple masters running, we can always assume RS didn't online the region if we don't know about it.
          Then, for messages to RS, see "Note on messages"; they are idempotent so they can always be resent.

          1) State update coordination. What is a "state updates from the outside" Do RS's initiate splitting on their own? Maybe a picture would help so we can figure out if it is similar or different from hbck-master's?

          Yes, these are RS messages. They are mentioned in some operation descriptions in part 2 - opening->opened, closing->closed; splitting, etc.

          2) Single point of truth. hbck-master tries to define what single point of truth means by defining intended, current, and actual state data with durability properties on each kind. What do clients look at who modifies what?

          Sorry, don't understand the question. I mean single source of truth mainly about what is going on with the region; it is described in design considerations.
          I like the idea of "intended state", however without more detailed reading I am not sure how it works for multiple ops e.g. master recovering the region while the user intends to split it, so the split should be executed after it's opened.

          3) Table record: "if regions is out of date, it should be closed and reopened". It is not clear in master5 how regionservers find out that they are out of date. Moreover, how do clients talking to those RS's with stale versions know they are going to the correct RS especially in the face of RS failures due to timeout?

          On alter (and startup if failed), master tries to reopen all regions that are out of date.
          Regions that are not opened with either pick up the new version when they are opened, or (e.g. if they are now Opening with old version) master discovers they are out of date when they are transitioned to Opened by RS, and reopens them again.

          As for any case of alter on enabled table, there are no guarantees for clients.
          To provide these w/o disable/enable (or logical equivalent of coordinating all close-s and open-s), one would need some form of version-time-travel, or waiting for versions, or both.

          4) region record: transition states. This is really similar to hbck-masters current state and intended state. Shouldn't be defined as part of the region record?

          I mention somewhere that could be done. One thing is that if several paths are possible between states, it's useful to know which is taken.
          But do note that I store user intent separately from what is currently going on, so they are not exactly similar as far as I see.

          5) Note on user operations: the forgetting thing is scary to me – in your move split example, what happens if an RS reads state that is forgotten?

          I think my description of this might be too vague. State is not forgotten; previous intent is forgotten. I.e. if user does several operations in order that conflict (e.g. split and then merge), the first one will be canceled (safely ).
          Also, RS does not read state as a guideline to what needs to be done.

          6) table state machine. how do we guarantee clients are writing from the correct version in the in failures?

          Can you please elaborate?

          7) region state machine. Earlier draft hand splitting and merge cases. Are they elided in master5 or are not present any more. How would this get extended handle jeffrey's distributed log replay/fast write recovery feature?

          As I mention somewhere these could be separate states. I was kind of afraid of blowing up state machine too much, so I noticed that for split/merge you anyway store siblings/children, so you can recognize them and for most purposes different split-merge states are the same as Opened and Closed.

          I will add those back, it would make sense.

          8) logical interactions: sounds like master5 allows concurrent region and table operations. hbck-master (though not fully documented) only allows certain region transitions when the table is enabled or if the table is disabled. Are we sure we don't get into race conditions? What happens if disable gets issued – its possible for someone to reopens the region and for old clients to continue writing to it even though it is closed?

          Yes, parallelism is intended. You can never be sure you have no races but we should aim for it

          master5 is missing disabled/enabled check, that is a mistake.

          Part1 operation interactions already cover it:

          table disable doesn't ack until all regions are closed (master5 is wrong ).
          region opening cannot start if table is already disabling or disabled.
          if region is already opening when disable is issued, opening will be opportunistically canceled.
          if disable fails to cancel opening, or server opens it first in a race, region will be opened, and master will issue close immediately after state update. Given that region is not closed, disable is not complete.
          if opening (or closing) times out, master will fence off RS and mark region as closed. If there was some way of fencing region separately (ZK lease?) it would be possible to use that.

          In any case, until client checks table state before every write, there's no easy way to prevent writes on disabling table. Writes on disabled table will not be possible.

          On ensuring there's no double assignment due to RS hanging:
          The intent is to fence the WAL for region server, the way we do now. One could also use other mechanism.
          Perhaps I could specify it more clearly; I think the problem of making sure RS is dead is nearly orthogonal.
          In my model, due to how opening region is committed to opened, we can only be unsure when the region is in Opened state (or similar states such as Splitting which are not present in my current version, but will be added).
          In that case, in absence of normal transition, we cannot do literally anything with the region unless we are sufficiently sure that RS is sufficiently dead (e.g. cannot write).
          So, while we ensure that RS is dead we don't reassign.
          My document implies (but doesn't elaborate, I'll fix that) that master does direct Opened->Closed direct transition only when that is true.
          A state called "MaybeOpened" could be added. Let me add it...

          Show
          Sergey Shelukhin added a comment - - edited Answers lifted from email also (some fixes + one answer was modified due to clarification here ). What is a failure and how do you react to failures? I think the master5 design needs to spend more effort to considering failure and recovery cases. I claim there are 4 types of responses from a networked IO operation - two states we normally deal with ack successful, ack failed (nack) and unknown due to timeout that succeeded (timeout success) and unknown due to timeout that failed (timeout failed). We have historically missed the last two cases and they aren't considered in the master5 design. There are a few considerations. Let me examine if there are other cases than these. I am assuming the collocated table, which should reduce such cases for state (probably, if collocated table cannot be written reliably, master must stop-the-world and fail over). When RS contacts master to do state update, it errs on the side of caution - no state update, no open region (or split). Thus, except for the case of multiple masters running, we can always assume RS didn't online the region if we don't know about it. Then, for messages to RS, see "Note on messages"; they are idempotent so they can always be resent. 1) State update coordination. What is a "state updates from the outside" Do RS's initiate splitting on their own? Maybe a picture would help so we can figure out if it is similar or different from hbck-master's? Yes, these are RS messages. They are mentioned in some operation descriptions in part 2 - opening->opened, closing->closed; splitting, etc. 2) Single point of truth. hbck-master tries to define what single point of truth means by defining intended, current, and actual state data with durability properties on each kind. What do clients look at who modifies what? Sorry, don't understand the question. I mean single source of truth mainly about what is going on with the region; it is described in design considerations. I like the idea of "intended state", however without more detailed reading I am not sure how it works for multiple ops e.g. master recovering the region while the user intends to split it, so the split should be executed after it's opened. 3) Table record: "if regions is out of date, it should be closed and reopened". It is not clear in master5 how regionservers find out that they are out of date. Moreover, how do clients talking to those RS's with stale versions know they are going to the correct RS especially in the face of RS failures due to timeout? On alter (and startup if failed), master tries to reopen all regions that are out of date. Regions that are not opened with either pick up the new version when they are opened, or (e.g. if they are now Opening with old version) master discovers they are out of date when they are transitioned to Opened by RS, and reopens them again. As for any case of alter on enabled table, there are no guarantees for clients. To provide these w/o disable/enable (or logical equivalent of coordinating all close-s and open-s), one would need some form of version-time-travel, or waiting for versions, or both. 4) region record: transition states. This is really similar to hbck-masters current state and intended state. Shouldn't be defined as part of the region record? I mention somewhere that could be done. One thing is that if several paths are possible between states, it's useful to know which is taken. But do note that I store user intent separately from what is currently going on, so they are not exactly similar as far as I see. 5) Note on user operations: the forgetting thing is scary to me – in your move split example, what happens if an RS reads state that is forgotten? I think my description of this might be too vague. State is not forgotten; previous intent is forgotten. I.e. if user does several operations in order that conflict (e.g. split and then merge), the first one will be canceled (safely ). Also, RS does not read state as a guideline to what needs to be done. 6) table state machine. how do we guarantee clients are writing from the correct version in the in failures? Can you please elaborate? 7) region state machine. Earlier draft hand splitting and merge cases. Are they elided in master5 or are not present any more. How would this get extended handle jeffrey's distributed log replay/fast write recovery feature? As I mention somewhere these could be separate states. I was kind of afraid of blowing up state machine too much, so I noticed that for split/merge you anyway store siblings/children, so you can recognize them and for most purposes different split-merge states are the same as Opened and Closed. I will add those back, it would make sense. 8) logical interactions: sounds like master5 allows concurrent region and table operations. hbck-master (though not fully documented) only allows certain region transitions when the table is enabled or if the table is disabled. Are we sure we don't get into race conditions? What happens if disable gets issued – its possible for someone to reopens the region and for old clients to continue writing to it even though it is closed? Yes, parallelism is intended. You can never be sure you have no races but we should aim for it master5 is missing disabled/enabled check, that is a mistake. Part1 operation interactions already cover it: table disable doesn't ack until all regions are closed (master5 is wrong ). region opening cannot start if table is already disabling or disabled. if region is already opening when disable is issued, opening will be opportunistically canceled. if disable fails to cancel opening, or server opens it first in a race, region will be opened, and master will issue close immediately after state update. Given that region is not closed, disable is not complete. if opening (or closing) times out, master will fence off RS and mark region as closed. If there was some way of fencing region separately (ZK lease?) it would be possible to use that. In any case, until client checks table state before every write, there's no easy way to prevent writes on disabling table. Writes on disabled table will not be possible. On ensuring there's no double assignment due to RS hanging: The intent is to fence the WAL for region server, the way we do now. One could also use other mechanism. Perhaps I could specify it more clearly; I think the problem of making sure RS is dead is nearly orthogonal. In my model, due to how opening region is committed to opened, we can only be unsure when the region is in Opened state (or similar states such as Splitting which are not present in my current version, but will be added). In that case, in absence of normal transition, we cannot do literally anything with the region unless we are sufficiently sure that RS is sufficiently dead (e.g. cannot write). So, while we ensure that RS is dead we don't reassign. My document implies (but doesn't elaborate, I'll fix that) that master does direct Opened->Closed direct transition only when that is true. A state called "MaybeOpened" could be added. Let me add it...
          Hide
          stack added a comment -

          Suggest moving the out on the dev mailing list as per the bible quoted below. Start a thread there?

          From "Producing Open Source Software":

          "Make sure the bug tracker doesn't turn into a discussion forum.
          Although it is important to maintain a human presence in the bug
          tracker, it is not fundamentally suited to real-time discussion. Think
          of it rather as an archiver, a way to organize facts and references
          to other discussions, primarily those that take place on mailing lists.

          "There are two reasons to make this distinction. First, the bug
          tracker is more cumbersome to use than the mailing lists (or than
          real-time chat forums, for that matter). This is not because bug
          trackers have bad user interface design, it's just that their interfaces
          were designed for capturing and presenting discrete states, not
          free-flowing discussions. Second, not everyone who should be
          involved in discussing a given issue is necessarily watching the bug
          tracker. Part of good issue management...is to make sure each issue
          is brought to the right peoples' attention, rather than requiring every
          developer to monitor all issues. In the section called “No
          Conversations in the Bug Tracker” in
          Chapter 6, Communications, we'll look at ways to make sure people
          don't accidentally siphon discussions out of appropriate forums
          and into the bug tracker."

          Pg. 50 of http://producingoss.com/en/producingoss.pdf

          Show
          stack added a comment - Suggest moving the out on the dev mailing list as per the bible quoted below. Start a thread there? From "Producing Open Source Software": "Make sure the bug tracker doesn't turn into a discussion forum. Although it is important to maintain a human presence in the bug tracker, it is not fundamentally suited to real-time discussion. Think of it rather as an archiver, a way to organize facts and references to other discussions, primarily those that take place on mailing lists. "There are two reasons to make this distinction. First, the bug tracker is more cumbersome to use than the mailing lists (or than real-time chat forums, for that matter). This is not because bug trackers have bad user interface design, it's just that their interfaces were designed for capturing and presenting discrete states, not free-flowing discussions. Second, not everyone who should be involved in discussing a given issue is necessarily watching the bug tracker. Part of good issue management...is to make sure each issue is brought to the right peoples' attention, rather than requiring every developer to monitor all issues. In the section called “No Conversations in the Bug Tracker” in Chapter 6, Communications, we'll look at ways to make sure people don't accidentally siphon discussions out of appropriate forums and into the bug tracker." Pg. 50 of http://producingoss.com/en/producingoss.pdf
          Hide
          Sergey Shelukhin added a comment -

          We need some convention for inline responses in the mailing list (or tell me if there's one)

          Show
          Sergey Shelukhin added a comment - We need some convention for inline responses in the mailing list (or tell me if there's one)
          Hide
          Jeffrey Zhong added a comment - - edited

          There are already two good write ups on the topic. Here is yet the another one: https://issues.apache.org/jira/secure/attachment/12609244/Is%20the%20FATE%20of%20Assignment%20Manager%20FATE.pdf

          The motivations I post this small draft are that I think we need a systematic model for things like table operations, region assignment and others like in future so that we can design them in a unified way in order for people easy to follow, easy to add test capabilities and easy to reason about the result.

          I think FATE provides us those capabilities. We can view FATE as a design/system model rather than an cold Accumulo implementation(in the sense we can change the implementation to fit HBase cases).
          Under this model, we can simplify region assignment and force feature implementer to code in such a way that a partial failed operation can be resumed/retied and leave executions to the framework.

          The draft is only two pages long(if you know FATE it should only be one page) and hope we don't drop FATE as an design option too quickly.

          Show
          Jeffrey Zhong added a comment - - edited There are already two good write ups on the topic. Here is yet the another one: https://issues.apache.org/jira/secure/attachment/12609244/Is%20the%20FATE%20of%20Assignment%20Manager%20FATE.pdf The motivations I post this small draft are that I think we need a systematic model for things like table operations, region assignment and others like in future so that we can design them in a unified way in order for people easy to follow, easy to add test capabilities and easy to reason about the result. I think FATE provides us those capabilities. We can view FATE as a design/system model rather than an cold Accumulo implementation(in the sense we can change the implementation to fit HBase cases). Under this model, we can simplify region assignment and force feature implementer to code in such a way that a partial failed operation can be resumed/retied and leave executions to the framework. The draft is only two pages long(if you know FATE it should only be one page) and hope we don't drop FATE as an design option too quickly.
          Hide
          Jonathan Hsieh added a comment -

          Attached a slightly revised version that fixed some typos and added paragraph to core design.

          Show
          Jonathan Hsieh added a comment - Attached a slightly revised version that fixed some typos and added paragraph to core design.
          Hide
          Jonathan Hsieh added a comment -

          Stack acked.

          Let's post design docs here, and move discussions comparing them to the mailing list.

          Sergey Shelukhin Let's name threads prefixed with [hbase-5487] in subject, and maybe rename subject lines if we get into a more focused discussion that warrants it own thread (it one part gets long), and in general reply inline. (I found this interesting http://en.wikipedia.org/wiki/Posting_style).

          I'll start by copying and pasting unresolved parts of the response-reply above to the dev mailing list.

          Show
          Jonathan Hsieh added a comment - Stack acked. Let's post design docs here, and move discussions comparing them to the mailing list. Sergey Shelukhin Let's name threads prefixed with [hbase-5487] in subject, and maybe rename subject lines if we get into a more focused discussion that warrants it own thread (it one part gets long), and in general reply inline. (I found this interesting http://en.wikipedia.org/wiki/Posting_style ). I'll start by copying and pasting unresolved parts of the response-reply above to the dev mailing list.
          Hide
          Sergey Shelukhin added a comment -

          Updating the part 1 doc based on Jonathan Hsieh feedback (the parts that are in the scope of part 1).
          Main changes: added details about "single source of truth"/"split-brain", as well as added some "out of the scope" stuff about RS fencing, HA, etc. with references to possible solutions.
          Also some minor changes.
          Let me try to update part 2 for tomorrow...

          Show
          Sergey Shelukhin added a comment - Updating the part 1 doc based on Jonathan Hsieh feedback (the parts that are in the scope of part 1). Main changes: added details about "single source of truth"/"split-brain", as well as added some "out of the scope" stuff about RS fencing, HA, etc. with references to possible solutions. Also some minor changes. Let me try to update part 2 for tomorrow...
          Hide
          Sergey Shelukhin added a comment -

          Ah well, I never got to part 2. Did you guys make progress on this? I may have time to resurrect this again soon.

          Show
          Sergey Shelukhin added a comment - Ah well, I never got to part 2. Did you guys make progress on this? I may have time to resurrect this again soon.
          Hide
          Jonathan Hsieh added a comment -

          Matteo and Aleks bring up an interesting case that any new master design should handle. HBASE-10136

          Show
          Jonathan Hsieh added a comment - Matteo and Aleks bring up an interesting case that any new master design should handle. HBASE-10136
          Hide
          Sergey Shelukhin added a comment -

          That's an interesting one. Given that snapshots by default have no guarantees wrt consistent writes between regions (or do they), seems like snapshot should get the latest schema in case of concurrent alter. Is there any consideration (other the arguably implementation issues of not recovering from close-open) that would prevent that? For consistent snapshots presumably the schema can be snapshotted first, I am assuming they don't stop the world and just take seqId/mvcc/ts or something, so the newer values with new schema will just "not exist".

          Show
          Sergey Shelukhin added a comment - That's an interesting one. Given that snapshots by default have no guarantees wrt consistent writes between regions (or do they), seems like snapshot should get the latest schema in case of concurrent alter. Is there any consideration (other the arguably implementation issues of not recovering from close-open) that would prevent that? For consistent snapshots presumably the schema can be snapshotted first, I am assuming they don't stop the world and just take seqId/mvcc/ts or something, so the newer values with new schema will just "not exist".
          Hide
          Jonathan Hsieh added a comment -

          The problem isn't that you would get snapshots with inconsistent schemas if the two operations were issued concurrently. It is that open is async and outside the table write lock which means the snapshot would fail because the region may no have been open.

          This is a particular case where we would want the open routines to act synchronously with table alters and split daugher region opens (both open before table lock released and snapshot can happen).

          Show
          Jonathan Hsieh added a comment - The problem isn't that you would get snapshots with inconsistent schemas if the two operations were issued concurrently. It is that open is async and outside the table write lock which means the snapshot would fail because the region may no have been open. This is a particular case where we would want the open routines to act synchronously with table alters and split daugher region opens (both open before table lock released and snapshot can happen).
          Hide
          Sergey Shelukhin added a comment -

          IMHO this, in case of opens, promotes not being fault tolerant. In large clusters you cannot get around servers failing and regions closing and reopening. Snapshot should just be able to ride over that. Splits are more interesting.
          Esp. if snapshots are used more (MR over snapshots), it may be nonviable to prevent splits and other operations for the duration of every snapshot, alter, ...

          Show
          Sergey Shelukhin added a comment - IMHO this, in case of opens, promotes not being fault tolerant. In large clusters you cannot get around servers failing and regions closing and reopening. Snapshot should just be able to ride over that. Splits are more interesting. Esp. if snapshots are used more (MR over snapshots), it may be nonviable to prevent splits and other operations for the duration of every snapshot, alter, ...
          Hide
          Andrew Purtell added a comment -

          MR over snapshots is already a terrible idea from a security perspective.

          Show
          Andrew Purtell added a comment - MR over snapshots is already a terrible idea from a security perspective.
          Hide
          Sergey Shelukhin added a comment -

          Yet, it's a very good idea from perf perspective, and logical given that many large MR jobs don't need realtime data. Snapshots can still be secured, and table-level granularity is sufficient for most cases I'd suspect.
          Regardless, it was just an example here.
          MR over snapshots can be discussed HBASE-8369

          Show
          Sergey Shelukhin added a comment - Yet, it's a very good idea from perf perspective, and logical given that many large MR jobs don't need realtime data. Snapshots can still be secured, and table-level granularity is sufficient for most cases I'd suspect. Regardless, it was just an example here. MR over snapshots can be discussed HBASE-8369
          Hide
          Andrew Purtell added a comment -

          Snapshots can still be secured

          This is debatable, and that is my point for bringing it up here. All of the enterprise customers I interact with universally want more than table-level granularity, which is why we spent so much time on cell granularity features recently - all of which are totally defeated by MR over snapshots.

          Bringing up MR snapshots as technical justification for other arguments needs qualification that MR over snapshots itself may have limited applicability.

          Show
          Andrew Purtell added a comment - Snapshots can still be secured This is debatable, and that is my point for bringing it up here. All of the enterprise customers I interact with universally want more than table-level granularity, which is why we spent so much time on cell granularity features recently - all of which are totally defeated by MR over snapshots. Bringing up MR snapshots as technical justification for other arguments needs qualification that MR over snapshots itself may have limited applicability.
          Hide
          Sergey Shelukhin added a comment -

          There are other justifications... my point is that having a lock over distributed, lengthy operations on tables, esp. with region-level component blocking table-level ops also, is the king of all epic locks, and can cause lots of problems, esp. in large clusters. Snapshot is just one example.

          Show
          Sergey Shelukhin added a comment - There are other justifications... my point is that having a lock over distributed, lengthy operations on tables, esp. with region-level component blocking table-level ops also, is the king of all epic locks, and can cause lots of problems, esp. in large clusters. Snapshot is just one example.
          Hide
          Jonathan Hsieh added a comment -

          Having snapshots succeed while splits, merges and alters can be handled with the open synchronization point. Having snapshots succeed through failovers would require some major revamping. We can file that issue – roughly it would be coordinating based on region name instead of region server name. (non-trivial work).

          Show
          Jonathan Hsieh added a comment - Having snapshots succeed while splits, merges and alters can be handled with the open synchronization point. Having snapshots succeed through failovers would require some major revamping. We can file that issue – roughly it would be coordinating based on region name instead of region server name. (non-trivial work).

            People

            • Assignee:
              Sergey Shelukhin
              Reporter:
              Mubarak Seyed
            • Votes:
              0 Vote for this issue
              Watchers:
              44 Start watching this issue

              Dates

              • Created:
                Updated:

                Development