HBase
  1. HBase
  2. HBASE-4991

Provide capability to delete named region

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      See discussion titled 'Able to control routing to Solr shards or not' on lily-discuss
      User may want to quickly dispose of out of date records by deleting specific regions.

      1. HBASE-4991.trunk.v1.patch
        100 kB
        Mubarak Seyed
      2. HBASE-4991.trunk.v2.patch
        100 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Mubarak Seyed added a comment -

          Please close this one. Thanks.

          Show
          Mubarak Seyed added a comment - Please close this one. Thanks.
          Hide
          Lars Hofhansl added a comment -

          Pushing to 0.96, but maybe we should just close this.

          Show
          Lars Hofhansl added a comment - Pushing to 0.96, but maybe we should just close this.
          Hide
          stack added a comment -

          @Mubarak Should we then close this issue as subsumed by the new issue (we can reuse your code pasted here over in the new issue after we work through design). Good stuff.

          Show
          stack added a comment - @Mubarak Should we then close this issue as subsumed by the new issue (we can reuse your code pasted here over in the new issue after we work through design). Good stuff.
          Hide
          Mubarak Seyed added a comment -

          HBASE-5504 will be used for new design and brainstorming. Thanks.

          Show
          Mubarak Seyed added a comment - HBASE-5504 will be used for new design and brainstorming. Thanks.
          Hide
          Mubarak Seyed added a comment -

          @Stack

          It might be time to kill this issue and start up a new one. Not under 5487 though. What you think Mubarak? I'd think that if we started a new issue, it'd be called online merge and would first work out the design.

          Fine, let us start a new issue and keep brainstorming on design (until get a logical point)

          Show
          Mubarak Seyed added a comment - @Stack It might be time to kill this issue and start up a new one. Not under 5487 though. What you think Mubarak? I'd think that if we started a new issue, it'd be called online merge and would first work out the design. Fine, let us start a new issue and keep brainstorming on design (until get a logical point)
          Hide
          stack added a comment -

          I did already at '01/Mar/12 17:17'

          Show
          stack added a comment - I did already at '01/Mar/12 17:17'
          Hide
          Ted Yu added a comment -

          Revisiting the flow outlined @ 01/Mar/12 06:34, Mubarak was saying '10 to 18 goes to RS side'
          Can this be clarified ?

          It would be nice to mark each step with Master or RS (for region server)

          I agree distributed log splitting is a different type of process.

          Show
          Ted Yu added a comment - Revisiting the flow outlined @ 01/Mar/12 06:34, Mubarak was saying '10 to 18 goes to RS side' Can this be clarified ? It would be nice to mark each step with Master or RS (for region server) I agree distributed log splitting is a different type of process.
          Hide
          stack added a comment -

          When we change the location of rename() call from master to region server for distributed log splitting, the duration was shortened from 22 minutes to 7 minutes for the same dataset.

          Because rename was done via multiple clients rather than in parallel on master? You sure it wasn't because of something else? (Distributed splitting is a different type of process to what is going on here)

          What do you want to farm out to the regionservers? We are already farming out work in the design above. We ask the regionservers to close regions for us. You want to farm out more than this? Control? To what end other than complicating the design?

          I wonder if you have statistics showing that master-side operation (for moving/deleting data of old regions) makes no difference in performance w.r.t. distributed operations.

          Well, stands to reason I'd think. I'd put it on you to come up w/ proof that what seems reasonable actually isn't, at least when talking about the tens of regions at most which is what I think this issue is about.

          Show
          stack added a comment - When we change the location of rename() call from master to region server for distributed log splitting, the duration was shortened from 22 minutes to 7 minutes for the same dataset. Because rename was done via multiple clients rather than in parallel on master? You sure it wasn't because of something else? (Distributed splitting is a different type of process to what is going on here) What do you want to farm out to the regionservers? We are already farming out work in the design above. We ask the regionservers to close regions for us. You want to farm out more than this? Control? To what end other than complicating the design? I wonder if you have statistics showing that master-side operation (for moving/deleting data of old regions) makes no difference in performance w.r.t. distributed operations. Well, stands to reason I'd think. I'd put it on you to come up w/ proof that what seems reasonable actually isn't, at least when talking about the tens of regions at most which is what I think this issue is about.
          Hide
          Todd Lipcon added a comment -

          How is this a performance-critical path? Most of the use cases are around things like aging off old data, which isn't perf-critical at all. We should aim for simplicity rather than performance when things aren't in a hot path or in a time-to-recovery path.

          Show
          Todd Lipcon added a comment - How is this a performance-critical path? Most of the use cases are around things like aging off old data, which isn't perf-critical at all. We should aim for simplicity rather than performance when things aren't in a hot path or in a time-to-recovery path.
          Hide
          Ted Yu added a comment -

          When we change the location of rename() call from master to region server for distributed log splitting, the duration was shortened from 22 minutes to 7 minutes for the same dataset.
          I wonder if you have statistics showing that master-side operation (for moving/deleting data of old regions) makes no difference in performance w.r.t. distributed operations.

          Show
          Ted Yu added a comment - When we change the location of rename() call from master to region server for distributed log splitting, the duration was shortened from 22 minutes to 7 minutes for the same dataset. I wonder if you have statistics showing that master-side operation (for moving/deleting data of old regions) makes no difference in performance w.r.t. distributed operations.
          Hide
          stack added a comment -

          Ted: This has been raised already above as a concern and some suggestions have been made that we // certain ops. Your suggestion that we farm out the closing of regions to the regionservers themselves will make no difference regards how fast regions close and ditto regards deletes.

          Show
          stack added a comment - Ted: This has been raised already above as a concern and some suggestions have been made that we // certain ops. Your suggestion that we farm out the closing of regions to the regionservers themselves will make no difference regards how fast regions close and ditto regards deletes.
          Hide
          Ted Yu added a comment -

          We should consider extensibility of our design.

          Master is just asking the regionservers to close regions

          Since we're going to permit the deletion of multiple regions, the above operation may take some time. Especially when cluster is under load.

          Master can create regions and do the moving of data from old regions into new

          One optimization we're doing w.r.t. distributed log splitting is to move certain step from master to region server. The above may not scale when multiple regions are requested to be deleted.

          Show
          Ted Yu added a comment - We should consider extensibility of our design. Master is just asking the regionservers to close regions Since we're going to permit the deletion of multiple regions, the above operation may take some time. Especially when cluster is under load. Master can create regions and do the moving of data from old regions into new One optimization we're doing w.r.t. distributed log splitting is to move certain step from master to region server. The above may not scale when multiple regions are requested to be deleted.
          Hide
          stack added a comment -

          @Ted

          Do you think we should continue discussion on the new framework under HBASE-5487 ?

          It might be time to kill this issue and start up a new one. Not under 5487 though. What you think Mubarak? I'd think that if we started a new issue, it'd be called online merge and would first work out the design.

          @Mubarak

          10 to 18 goes to RS side....We need ZK trackers in both sides, isn't?

          I don't think so. Master is just asking the regionservers to close regions. Master can create regions and do the moving of data from old regions into new. Its just fs operations. No need of regionserver context, especially not live regionserver context.

          Show
          stack added a comment - @Ted Do you think we should continue discussion on the new framework under HBASE-5487 ? It might be time to kill this issue and start up a new one. Not under 5487 though. What you think Mubarak? I'd think that if we started a new issue, it'd be called online merge and would first work out the design. @Mubarak 10 to 18 goes to RS side....We need ZK trackers in both sides, isn't? I don't think so. Master is just asking the regionservers to close regions. Master can create regions and do the moving of data from old regions into new. Its just fs operations. No need of regionserver context, especially not live regionserver context.
          Hide
          Ted Yu added a comment -

          @Stack, @Mubarak:
          Do you think we should continue discussion on the new framework under HBASE-5487 ?
          This JIRA has become a novel - it is hard for other people to read.

          @Jieshan:
          Please publish your tool under a separate JIRA.

          Show
          Ted Yu added a comment - @Stack, @Mubarak: Do you think we should continue discussion on the new framework under HBASE-5487 ? This JIRA has become a novel - it is hard for other people to read. @Jieshan: Please publish your tool under a separate JIRA.
          Hide
          Jieshan Bean added a comment - - edited

          Yes that'll work. How you do it? You have a patch?

          @Stack:
          I have written a client tool to do this for 90. Only delete the specified regions. I will modify it and submit a patch if this tool is necessary.

          Show
          Jieshan Bean added a comment - - edited Yes that'll work. How you do it? You have a patch? @Stack: I have written a client tool to do this for 90. Only delete the specified regions. I will modify it and submit a patch if this tool is necessary.
          Hide
          Mubarak Seyed added a comment -

          Does the above presume all regions for a range are on a single regionserver (If not, how is the meta editing done – in particular the bridging of the hole in .META.?).

          No, the design was intended for a single region

          Show
          Mubarak Seyed added a comment - Does the above presume all regions for a range are on a single regionserver (If not, how is the meta editing done – in particular the bridging of the hole in .META.?). No, the design was intended for a single region
          Hide
          Mubarak Seyed added a comment -

          @Stack
          In your proposal, 2 to 4 are synchronous (client gets blocked) in master side, 5 returns the task id (Or error) to client, 6 to 9 are again in master side, 10 to 18 goes to RS side, and 19 to 20 in master side

          We need ZK trackers in both sides, isn't?

          Show
          Mubarak Seyed added a comment - @Stack In your proposal, 2 to 4 are synchronous (client gets blocked) in master side, 5 returns the task id (Or error) to client, 6 to 9 are again in master side, 10 to 18 goes to RS side, and 19 to 20 in master side We need ZK trackers in both sides, isn't?
          Hide
          Mubarak Seyed added a comment -

          Thanks Stack for your suggestions. Will think a bit.

          How many regions are we talking of merging/deleting at any one time? I think above should work for a big table as long was we did stuff in parallel; closes and file moving. To be confirmed.

          I have tested for one region at a time but if i change the code for multiple regions then it will work in parallel as each RS will do process as long as regions-to-be-deleted are part of online regions list.

          Show
          Mubarak Seyed added a comment - Thanks Stack for your suggestions. Will think a bit. How many regions are we talking of merging/deleting at any one time? I think above should work for a big table as long was we did stuff in parallel; closes and file moving. To be confirmed. I have tested for one region at a time but if i change the code for multiple regions then it will work in parallel as each RS will do process as long as regions-to-be-deleted are part of online regions list.
          Hide
          stack added a comment -

          @Mubarak Our comments crossed. See further up in this issue for more on what I'm thinking; it should answer the questions you pose.

          Show
          stack added a comment - @Mubarak Our comments crossed. See further up in this issue for more on what I'm thinking; it should answer the questions you pose.
          Hide
          stack added a comment -

          One more thing, it would be sweet if the above were not hardcoded but instead was a set of steps described elsewhere and malleable or even better, if we could describe the steps to run on top of some generic operations framework as per FATE, but that would be a bunch more work.

          How many regions are we talking of merging/deleting at any one time? I think above should work for a big table as long was we did stuff in parallel; closes and file moving. To be confirmed.

          Show
          stack added a comment - One more thing, it would be sweet if the above were not hardcoded but instead was a set of steps described elsewhere and malleable or even better, if we could describe the steps to run on top of some generic operations framework as per FATE, but that would be a bunch more work. How many regions are we talking of merging/deleting at any one time? I think above should work for a big table as long was we did stuff in parallel; closes and file moving. To be confirmed.
          Hide
          stack added a comment -

          One more thing, it would be sweet if the above were not hardcoded but instead was a set of steps described elsewhere and malleable or even better, if we could describe the steps to run on top of some generic operations framework as per FATE, but that would be a bunch more work.

          How many regions are we talking of merging/deleting at any one time? I think above should work for a big table as long was we did stuff in parallel; closes and file moving. To be confirmed.

          Show
          stack added a comment - One more thing, it would be sweet if the above were not hardcoded but instead was a set of steps described elsewhere and malleable or even better, if we could describe the steps to run on top of some generic operations framework as per FATE, but that would be a bunch more work. How many regions are we talking of merging/deleting at any one time? I think above should work for a big table as long was we did stuff in parallel; closes and file moving. To be confirmed.
          Hide
          stack added a comment -

          Oh, I forgot to mention that each step in the above should be repickupable – i.e. if the process running the above crashes, on restart it should continue where the previous left off – up until .META. edits (even here, we should make it so we can repair). We should include a cancel facility. Anything we develop would have to be testable; both the individual steps and then the process as a whole.

          Show
          stack added a comment - Oh, I forgot to mention that each step in the above should be repickupable – i.e. if the process running the above crashes, on restart it should continue where the previous left off – up until .META. edits (even here, we should make it so we can repair). We should include a cancel facility. Anything we develop would have to be testable; both the individual steps and then the process as a whole.
          Hide
          Mubarak Seyed added a comment -

          I was suggesting doing it as a client script because then it'd be outside of the servers and easier to test. If client dies, restart it, it looks in zk for work to do and carries on from where the last client was. But no biggie.

          Is it something like jruby script (like region_mover.rb) in HBASE_HOME/bin dir? What happens if balancer is running or split is in middle for that region? I guess the script should check to see if there is a <zookeeper.znode.parent>/unassigned/<region-to-be-deleted> and RS should ignore split request if the region is in middle of deletion (by looking at <zookeeper.znode.parent>/delete-region/<region-to-be-deleted>)

          What about my question about why we delegate merge/delete out to the regionservers? Why not have them do nothing but the close and then have the master do the remove or merging of fs content and fixup in meta? Would that be less moving parts?

          Master delegate tasks to RS (eg: openRegion for assignment, closeRegion for move), are you asking something like a handler in master side to do the remove or merging of fs content and fixup in meta? We can as long as required methods are public in HRegionInterface (How about HRegion as we can't serialize it?)

          Show
          Mubarak Seyed added a comment - I was suggesting doing it as a client script because then it'd be outside of the servers and easier to test. If client dies, restart it, it looks in zk for work to do and carries on from where the last client was. But no biggie. Is it something like jruby script (like region_mover.rb) in HBASE_HOME/bin dir? What happens if balancer is running or split is in middle for that region? I guess the script should check to see if there is a <zookeeper.znode.parent>/unassigned/<region-to-be-deleted> and RS should ignore split request if the region is in middle of deletion (by looking at <zookeeper.znode.parent>/delete-region/<region-to-be-deleted>) What about my question about why we delegate merge/delete out to the regionservers? Why not have them do nothing but the close and then have the master do the remove or merging of fs content and fixup in meta? Would that be less moving parts? Master delegate tasks to RS (eg: openRegion for assignment, closeRegion for move), are you asking something like a handler in master side to do the remove or merging of fs content and fixup in meta? We can as long as required methods are public in HRegionInterface (How about HRegion as we can't serialize it?)
          Hide
          stack added a comment -

          @Mubarak I see that this patch is modeled on HBASE-4213, the online schema-edit patch. I'm not sure that is a good model to follow in the first place – its disabled because it does not currently work in the face of splits though it has handler code supposedly to manage this and secondly, its a bunch of custom code specific to the schema change only. Your patch does a bunch of copy/paste from the schema patch duplicating the model and then also repeating code except for some changes in method names and the znodes we wait on up in zk. Rather don't you think we should be generalizing the common facility and having these two features share its use rather than making a copy, especially since we now we have two clients in need (Its actually three if you count merge, which IMO, this feature should be built on). For example, in both cases we need to disable table splitting. In the schema patch it does this with a waitForInflightSchemaChange check that looks at state in zk and then in the splitRegion code, we wait by invoking the below:

          waitForSchemaChange(Bytes.toString(regionInfo.getTableName()));
          

          You come along and do a repeat. You add to the splitRegion code:

          +   waitForDeleteRegion(regionInfo.getEncodedName());
          

          The list of things to check before we go ahead and split could get pretty long if we keep on down this route.

          Instead we should have a generic disable splitting function that both schema edit and this patch could use.

          Going back to your design, I see this:

          4. DeleteRegionTracker (new class in RS side) will process nodeChildrenChanged(), get the list of regions_to_be_deleted, check that those regions are being hosted by the RS, if yes then
          
          doDeleteRegion
          call deleteRegion() in HRegionServer
          disable the region split
          close the region
          remove from META
          bridge the whole in META (extending the span of before or after region)
          remove region directory from HDFS
          update state in ZK (<zookeeper.znode.parent>/delete-region/<encoded-region-name>)
          

          Does the above presume all regions for a range are on a single regionserver (If not, how is the meta editing done – in particular the bridging of the hole in .META.?).

          I'm asking because I think its not a good design asking regionservers to do the merge; it makes this patch more complicated than it need be IMO.

          I suggest we go back to the design and work forward from there. Your patch is fat and has a bunch of good stuff that we can repurpose once we have the design done.

          I suggest a design below. It has some prerequisites, some general function that this feature could use (and others). The prereqs if you think them good, could be done outside of this JIRA.

          Here's a suggested rough outline of how I think this feature should run. The feature I'm describing below is merge and deleteRegion for I see them as in essence the same thing.

          1. Client calls merge or deleteRegion API. API is a range of rows.
          2. Master gets call.
          3. Master obtains a write lock on table so it can't be disabled from under us. The write lock will also disable splitting. This is one of the prereqs I think. Its HBASE-5494 (Or we could just do something simpler where we have a flag up in zk that splitRegion checks but thats less useful I think; OR we do the dynamic configs issue and set splits to off via a config. change). There'd be a timer for how long we wait on the table lock.
          4. If we get the lock, write intent to merge a range up into zk. It also hoists into zk if its a pure merge or a merge that drops the region data (a deleteRegion call)
          5. Return to the client either our failed attempt at locking the table or an id of some sort used identifying this running operation; can use it querying status.
          6. Turn off balancer. TODO/prereq: Do it in a way that is persisted. Balancer switch currently in memory only so if master crashes, new master will come up in balancing mode # (If we had dynamic config. could hoist up to zk a config. that disables the balancer rather than have a balancer-specific flag/znode OR if a write lock outstanding on a table, then the balancer does not balance regions in the locked table – this latter might be the easiest to do)
          7. Write into zk that just turned off the balancer (If it was on)
          8. Get regions that are involved in the span
          9. Hoist the list up into zk.
          10. Create region to span the range.
          11. Write that we did this up into zk.
          12. Close regions in parallel. Confirm close in parallel.
          13. Write up into zk regions closed (This might not be necessary since can ask if region is open).
          14. If a merge and not a delete region, move files under new region. Might multithread this (moves should go pretty fast). If a deleteregion, we skip this step.
          15. On completion mark zk (though may not be necessary since its easy to look in fs to see state of move).
          16. Edit .META.
          17. Confirm edits went in.
          18. Move old regions to hbase trash folder TODO: There is no trash folder under /hbase currently. We should add one.
          19. Enable balancer (if it was off)
          20. Unlock table

          Done

          Above is a suggestion. It'd get us merge and your deleteRegion.

          Show
          stack added a comment - @Mubarak I see that this patch is modeled on HBASE-4213 , the online schema-edit patch. I'm not sure that is a good model to follow in the first place – its disabled because it does not currently work in the face of splits though it has handler code supposedly to manage this and secondly, its a bunch of custom code specific to the schema change only. Your patch does a bunch of copy/paste from the schema patch duplicating the model and then also repeating code except for some changes in method names and the znodes we wait on up in zk. Rather don't you think we should be generalizing the common facility and having these two features share its use rather than making a copy, especially since we now we have two clients in need (Its actually three if you count merge, which IMO, this feature should be built on). For example, in both cases we need to disable table splitting. In the schema patch it does this with a waitForInflightSchemaChange check that looks at state in zk and then in the splitRegion code, we wait by invoking the below: waitForSchemaChange(Bytes.toString(regionInfo.getTableName())); You come along and do a repeat. You add to the splitRegion code: + waitForDeleteRegion(regionInfo.getEncodedName()); The list of things to check before we go ahead and split could get pretty long if we keep on down this route. Instead we should have a generic disable splitting function that both schema edit and this patch could use. Going back to your design, I see this: 4. DeleteRegionTracker ( new class in RS side) will process nodeChildrenChanged(), get the list of regions_to_be_deleted, check that those regions are being hosted by the RS, if yes then doDeleteRegion call deleteRegion() in HRegionServer disable the region split close the region remove from META bridge the whole in META (extending the span of before or after region) remove region directory from HDFS update state in ZK (<zookeeper.znode.parent>/delete-region/<encoded-region-name>) Does the above presume all regions for a range are on a single regionserver (If not, how is the meta editing done – in particular the bridging of the hole in .META.?). I'm asking because I think its not a good design asking regionservers to do the merge; it makes this patch more complicated than it need be IMO. I suggest we go back to the design and work forward from there. Your patch is fat and has a bunch of good stuff that we can repurpose once we have the design done. I suggest a design below. It has some prerequisites, some general function that this feature could use (and others). The prereqs if you think them good, could be done outside of this JIRA. Here's a suggested rough outline of how I think this feature should run. The feature I'm describing below is merge and deleteRegion for I see them as in essence the same thing. Client calls merge or deleteRegion API. API is a range of rows. Master gets call. Master obtains a write lock on table so it can't be disabled from under us. The write lock will also disable splitting. This is one of the prereqs I think. Its HBASE-5494 (Or we could just do something simpler where we have a flag up in zk that splitRegion checks but thats less useful I think; OR we do the dynamic configs issue and set splits to off via a config. change). There'd be a timer for how long we wait on the table lock. If we get the lock, write intent to merge a range up into zk. It also hoists into zk if its a pure merge or a merge that drops the region data (a deleteRegion call) Return to the client either our failed attempt at locking the table or an id of some sort used identifying this running operation; can use it querying status. Turn off balancer. TODO/prereq: Do it in a way that is persisted. Balancer switch currently in memory only so if master crashes, new master will come up in balancing mode # (If we had dynamic config. could hoist up to zk a config. that disables the balancer rather than have a balancer-specific flag/znode OR if a write lock outstanding on a table, then the balancer does not balance regions in the locked table – this latter might be the easiest to do) Write into zk that just turned off the balancer (If it was on) Get regions that are involved in the span Hoist the list up into zk. Create region to span the range. Write that we did this up into zk. Close regions in parallel. Confirm close in parallel. Write up into zk regions closed (This might not be necessary since can ask if region is open). If a merge and not a delete region, move files under new region. Might multithread this (moves should go pretty fast). If a deleteregion, we skip this step. On completion mark zk (though may not be necessary since its easy to look in fs to see state of move). Edit .META. Confirm edits went in. Move old regions to hbase trash folder TODO: There is no trash folder under /hbase currently. We should add one. Enable balancer (if it was off) Unlock table Done Above is a suggestion. It'd get us merge and your deleteRegion.
          Hide
          stack added a comment -

          Well, client's deleteRegion call is asynchronous so no fail-over if client has to do the business.

          Fair enough. I was suggesting doing it as a client script because then it'd be outside of the servers and easier to test. If client dies, restart it, it looks in zk for work to do and carries on from where the last client was. But no biggie.

          What about my question about why we delegate merge/delete out to the regionservers? Why not have them do nothing but the close and then have the master do the remove or merging of fs content and fixup in meta? Would that be less moving parts?

          Let me give some higher level feedback in a sec.

          @Jieshan Yes that'll work. How you do it? You have a patch?

          Show
          stack added a comment - Well, client's deleteRegion call is asynchronous so no fail-over if client has to do the business. Fair enough. I was suggesting doing it as a client script because then it'd be outside of the servers and easier to test. If client dies, restart it, it looks in zk for work to do and carries on from where the last client was. But no biggie. What about my question about why we delegate merge/delete out to the regionservers? Why not have them do nothing but the close and then have the master do the remove or merging of fs content and fixup in meta? Would that be less moving parts? Let me give some higher level feedback in a sec. @Jieshan Yes that'll work. How you do it? You have a patch?
          Hide
          Jieshan Bean added a comment -

          We also have this requirement: delete some specified regions, specillaly about the data in HDFS, and merge those regions. We do it after diable the table. So it seems very simple:
          1. Ensure the table has been disabled.
          2. Scan META, and find our all the regions should be deleted.
          3. Delete information from .META.
          4. Delete Region directory in HDFS.
          5. Add a new empty region to avoid region hole in .META.
          6. enable table.

          Show
          Jieshan Bean added a comment - We also have this requirement: delete some specified regions, specillaly about the data in HDFS, and merge those regions. We do it after diable the table. So it seems very simple: 1. Ensure the table has been disabled. 2. Scan META, and find our all the regions should be deleted. 3. Delete information from .META. 4. Delete Region directory in HDFS. 5. Add a new empty region to avoid region hole in .META. 6. enable table.
          Hide
          Mubarak Seyed added a comment -

          What do we have the regionserver do anything but close of the region? Why do we delegate to it the deletion? Why not have it done by the master? Or a client script? Have it remove the region from .META. and from the fs? And bridge the hole in .META.? Isn't that less complicated?

          Well, client's deleteRegion call is asynchronous so no fail-over if client has to do the business.
          Regarding master, it acts as a coordinator between client and RS, meaning it is like move() region task (but split() goes from client to RS). Master does the cleanup job of deleting the failed delete-region znodes if they exceeds the configured timeout value (30 minutes)

              this.deleteRegionTracker = new MasterDeleteRegionTracker(getZooKeeper(),
                  this,this, conf.getInt("hbase.delete.region.timeout", 1800000));
          

          If we have to make client call to RS (as like compact or split) for deleteRegion then how do we do clean-up? How about master-failover?

          Show
          Mubarak Seyed added a comment - What do we have the regionserver do anything but close of the region? Why do we delegate to it the deletion? Why not have it done by the master? Or a client script? Have it remove the region from .META. and from the fs? And bridge the hole in .META.? Isn't that less complicated? Well, client's deleteRegion call is asynchronous so no fail-over if client has to do the business. Regarding master, it acts as a coordinator between client and RS, meaning it is like move() region task (but split() goes from client to RS). Master does the cleanup job of deleting the failed delete-region znodes if they exceeds the configured timeout value (30 minutes) this .deleteRegionTracker = new MasterDeleteRegionTracker(getZooKeeper(), this , this , conf.getInt( "hbase.delete.region.timeout" , 1800000)); If we have to make client call to RS (as like compact or split) for deleteRegion then how do we do clean-up? How about master-failover?
          Hide
          stack added a comment -

          Looking at the patch again, yeah, we should expose deleteRegion apis, but it should all run on a merge engine, not this specialization on merge, a 'delete region engine'

          Looking at the design:

          + What do we have the regionserver do anything but close of the region? Why do we delegate to it the deletion? Why not have it done by the master? Or a client script? Have it remove the region from .META. and from the fs? And bridge the hole in .META.? Isn't that less complicated?

          Show
          stack added a comment - Looking at the patch again, yeah, we should expose deleteRegion apis, but it should all run on a merge engine, not this specialization on merge, a 'delete region engine' Looking at the design: + What do we have the regionserver do anything but close of the region? Why do we delegate to it the deletion? Why not have it done by the master? Or a client script? Have it remove the region from .META. and from the fs? And bridge the hole in .META.? Isn't that less complicated?
          Hide
          Mubarak Seyed added a comment -

          Implementation-wise, this is a merge with the added option that we not copy the data of the regions we are merging.

          Agreed but if we name the command like

          merge <table_name> <start_key> <end_key>
          

          then it means that clubbing multiple (2 or more) to a single region (provided data remains same)

          Show
          Mubarak Seyed added a comment - Implementation-wise, this is a merge with the added option that we not copy the data of the regions we are merging. Agreed but if we name the command like merge <table_name> <start_key> <end_key> then it means that clubbing multiple (2 or more) to a single region (provided data remains same)
          Hide
          stack added a comment -

          Implementation-wise, this is a merge with the added option that we not copy the data of the regions we are merging.

          Show
          stack added a comment - Implementation-wise, this is a merge with the added option that we not copy the data of the regions we are merging.
          Hide
          Mubarak Seyed added a comment -

          @Stack

          sounds like the command/api could also be named merge rather than deleteRegion (You are not actually deleting the data, is that right?)?

          We do delete the data of region (to be deleted)> The real confusion with the terminology here is that merge makes 2 or more regions into one with data remains same (but the start/end keys are different after the merge) but deleteRegion makes one region (or multiple regions) data will be deleted from HDFS (and make a single region in meta with modified start/end key).

          For example:
          We have three regions namely R1, R2, and R3

          When we do merge R2 with R3 -> new region would be (R2R3)' and data of R2 and R3 will be moved to (R2R3)'
          so, in META it would look like

          R1 -> start/end key, location
          (R2R3)' -> modified start/end key, location
          r4 -> start/end key, location

          When we do deleteRange of single region (R2) -> new region would be (R1)', meaning R1 end key be R2s end key, data of R2 will be deleted, and R1 data will be merged with R1'
          so, in META it would look like

          (R1)' -> modified start/end key, location
          R3 -> start/end key, location

          Make sense?

          Show
          Mubarak Seyed added a comment - @Stack sounds like the command/api could also be named merge rather than deleteRegion (You are not actually deleting the data, is that right?)? We do delete the data of region (to be deleted)> The real confusion with the terminology here is that merge makes 2 or more regions into one with data remains same (but the start/end keys are different after the merge) but deleteRegion makes one region (or multiple regions) data will be deleted from HDFS (and make a single region in meta with modified start/end key). For example: We have three regions namely R1, R2, and R3 When we do merge R2 with R3 -> new region would be (R2R3)' and data of R2 and R3 will be moved to (R2R3)' so, in META it would look like R1 -> start/end key, location (R2R3)' -> modified start/end key, location r4 -> start/end key, location When we do deleteRange of single region (R2) -> new region would be (R1)', meaning R1 end key be R2s end key, data of R2 will be deleted, and R1 data will be merged with R1' so, in META it would look like (R1)' -> modified start/end key, location R3 -> start/end key, location Make sense?
          Hide
          Ted Yu added a comment -

          The data for specified region is deleted. See the following in HRegionServer.java:

          +         // delete the dest region in FS
          +         deleteRegionFromFs(regionName,
          +             hRegion,
          +         RegionDeletionStatus.RegionDeletionState.DEST_REGION_DELETION_FROM_FS,
          +         RegionDeletionStatus.RegionDeletionState.DEST_REGION_DELETION_FROM_FS_FAILED);
          
          Show
          Ted Yu added a comment - The data for specified region is deleted. See the following in HRegionServer.java: + // delete the dest region in FS + deleteRegionFromFs(regionName, + hRegion, + RegionDeletionStatus.RegionDeletionState.DEST_REGION_DELETION_FROM_FS, + RegionDeletionStatus.RegionDeletionState.DEST_REGION_DELETION_FROM_FS_FAILED);
          Hide
          stack added a comment -

          @Lars

          This is not the same as merge, right?

          Sounds like it is.

          The region's data will be gone

          This patch seems to copy the data from the deleted region up into the new hole-plugging region. It doesn't seem to delete it.

          As to your 1., 2., 3... yes, thats what this patch does only the operators and the classes are all named DeleteRegion* blah when what is happening is region merging.

          I think its important to get the concept right else its going to confuse for ever after.

          @Mubarak So, sounds like the command/api could also be named merge rather than deleteRegion (You are not actually deleting the data, is that right?)?

          Show
          stack added a comment - @Lars This is not the same as merge, right? Sounds like it is. The region's data will be gone This patch seems to copy the data from the deleted region up into the new hole-plugging region. It doesn't seem to delete it. As to your 1., 2., 3... yes, thats what this patch does only the operators and the classes are all named DeleteRegion* blah when what is happening is region merging. I think its important to get the concept right else its going to confuse for ever after. @Mubarak So, sounds like the command/api could also be named merge rather than deleteRegion (You are not actually deleting the data, is that right?)?
          Hide
          Mubarak Seyed added a comment -

          So, we are agreed that conceptually, whats going on here is region merging?

          Merge with two-regions-data Vs Merge-with-one-region-data

          We do merge just to cover META hole, isn't? IMO deleteRange is a task/function from cmdLine/API and we do use merge for an implementation (as part of the task)

          Show
          Mubarak Seyed added a comment - So, we are agreed that conceptually, whats going on here is region merging? Merge with two-regions-data Vs Merge-with-one-region-data We do merge just to cover META hole, isn't? IMO deleteRange is a task/function from cmdLine/API and we do use merge for an implementation (as part of the task)
          Hide
          Lars Hofhansl added a comment -

          I have not been following this too closely, so pardon my ignorance... But it seems we're going off in a tangent.

          This is not the same as merge, right? The region's data will be gone, we just need to plug the .META. hole (hbck, could even do that for us).
          Or maybe we don't even touch .META. and just delete the DFS files.

          1. close/unassign the region
          2. remove the region's files
          3. either
          o re-open the region, i.e. keep the .META. entry for the empty region, and merge later when convenient
          or
          o remove the region's .META. entry
          4. done

          Maybe I'm just talking nonsense, but does it have to be much more complicated than this?

          Show
          Lars Hofhansl added a comment - I have not been following this too closely, so pardon my ignorance... But it seems we're going off in a tangent. This is not the same as merge, right? The region's data will be gone, we just need to plug the .META. hole (hbck, could even do that for us). Or maybe we don't even touch .META. and just delete the DFS files. 1. close/unassign the region 2. remove the region's files 3. either o re-open the region, i.e. keep the .META. entry for the empty region, and merge later when convenient or o remove the region's .META. entry 4. done Maybe I'm just talking nonsense, but does it have to be much more complicated than this?
          Hide
          stack added a comment -

          Online merge requires table needs to be disabled but deleteRegion (deleteRange) does not require table needs to be disabled.

          We've had ongoing conversation – before your time so you are not expected to have known about it – on our doing an online merge. Its actually pretty critical need. See HBASE-1621 for some history (Ignore its title where it says table should be offline – it should be online).

          FYI, the current merge code is broke and unused. It works for a unit test but I'd say its years since anyone tried to use it to actually do anything useful.

          So, we are agreed that conceptually, whats going on here is region merging? If so, that helps understanding around whats going on here. We should also likely rename what this issue does to be about merging since thats how we've been describing this operation over the years.

          Show
          stack added a comment - Online merge requires table needs to be disabled but deleteRegion (deleteRange) does not require table needs to be disabled. We've had ongoing conversation – before your time so you are not expected to have known about it – on our doing an online merge. Its actually pretty critical need. See HBASE-1621 for some history (Ignore its title where it says table should be offline – it should be online). FYI, the current merge code is broke and unused. It works for a unit test but I'd say its years since anyone tried to use it to actually do anything useful. So, we are agreed that conceptually, whats going on here is region merging? If so, that helps understanding around whats going on here. We should also likely rename what this issue does to be about merging since thats how we've been describing this operation over the years.
          Hide
          Mubarak Seyed added a comment -

          @Stack

          This feature looks to be adding online merge to me. I need clarification from Mubarak that that is indeed the case. If so, this issue is mislabeled and the patch needs redoing.

          Online merge requires table needs to be disabled but deleteRegion (deleteRange) does not require table needs to be disabled.
          We were discussing about making use of HMerge.merge() (please refer my comment @ 27/Jan/12 06:36) but it checks for half of region-size

          this.maxFilesize = conf.getLong(HConstants.HREGION_MAX_FILESIZE,
                    HConstants.DEFAULT_MAX_FILE_SIZE);
          ..
          if ((currentSize + nextSize) <= (maxFilesize / 2)) {
                    // We merge two adjacent regions if their total size is less than
                    // one half of the desired maximum size
          

          so, in this case an adjacent region size can be > half of max of region-size?

          Show
          Mubarak Seyed added a comment - @Stack This feature looks to be adding online merge to me. I need clarification from Mubarak that that is indeed the case. If so, this issue is mislabeled and the patch needs redoing. Online merge requires table needs to be disabled but deleteRegion (deleteRange) does not require table needs to be disabled. We were discussing about making use of HMerge.merge() (please refer my comment @ 27/Jan/12 06:36) but it checks for half of region-size this .maxFilesize = conf.getLong(HConstants.HREGION_MAX_FILESIZE, HConstants.DEFAULT_MAX_FILE_SIZE); .. if ((currentSize + nextSize) <= (maxFilesize / 2)) { // We merge two adjacent regions if their total size is less than // one half of the desired maximum size so, in this case an adjacent region size can be > half of max of region-size?
          Hide
          Mubarak Seyed added a comment -

          We are going to remove

          public int getRegionsCount(byte[] regionName) throws IOException
          

          as we will use getOnlineRegions() and process them in client-side, please refer my comment @ 24/Feb/12 00:16

          Will read the onlineMerge code.

          Show
          Mubarak Seyed added a comment - We are going to remove public int getRegionsCount( byte [] regionName) throws IOException as we will use getOnlineRegions() and process them in client-side, please refer my comment @ 24/Feb/12 00:16 Will read the onlineMerge code.
          Hide
          Ted Yu added a comment -

          For:

          +  public int getRegionsCount(byte[] regionName) throws IOException;
          

          regionName is used to extract table name. We should rename the method to, e.g. getRegionsCountForTable().

          w.r.t. utilizing code from Merge, I looked at Merge.mergeTwoRegions() and saw no fault handling code.
          I think if we add fault-tolerant code to Merge, it might look similar to what Mubarak has now.

          Show
          Ted Yu added a comment - For: + public int getRegionsCount( byte [] regionName) throws IOException; regionName is used to extract table name. We should rename the method to, e.g. getRegionsCountForTable(). w.r.t. utilizing code from Merge, I looked at Merge.mergeTwoRegions() and saw no fault handling code. I think if we add fault-tolerant code to Merge, it might look similar to what Mubarak has now.
          Hide
          Mubarak Seyed added a comment -

          @Stack
          Thanks for the review

          It just adds a bunch of custom facility w/o genericizing at least some aspects so could be used by other features yet to come.

          True, i never put much thought into generalization as i had focused only for delete-region. When we talk about other features yet to come, can't we leverage new generic framework (yet to come) and revisit this feature?

          Show
          Mubarak Seyed added a comment - @Stack Thanks for the review It just adds a bunch of custom facility w/o genericizing at least some aspects so could be used by other features yet to come. True, i never put much thought into generalization as i had focused only for delete-region. When we talk about other features yet to come, can't we leverage new generic framework (yet to come) and revisit this feature?
          Hide
          stack added a comment -

          Are we going to enhance Merge by allowing to discard data belonging to one of the regions ?

          This feature looks to be adding online merge to me. I need clarification from Mubarak that that is indeed the case. If so, this issue is mislabeled and the patch needs redoing.

          I was just suggesting that if you want to actually drop a regions data, you could pass a flag to merge and it would not bother copying over the files from old regions. That would be an option. This patch as is does not do that. It seems to copy old region data into new regions. Was just a suggestion.

          How should we deal with various failure scenarios in the process of merging ?

          Eh... in a manner which is resilient against failures, TBD. I don't get your question Ted. Are you asking me or the Author of this patch?

          Show
          stack added a comment - Are we going to enhance Merge by allowing to discard data belonging to one of the regions ? This feature looks to be adding online merge to me. I need clarification from Mubarak that that is indeed the case. If so, this issue is mislabeled and the patch needs redoing. I was just suggesting that if you want to actually drop a regions data, you could pass a flag to merge and it would not bother copying over the files from old regions. That would be an option. This patch as is does not do that. It seems to copy old region data into new regions. Was just a suggestion. How should we deal with various failure scenarios in the process of merging ? Eh... in a manner which is resilient against failures, TBD. I don't get your question Ted. Are you asking me or the Author of this patch?
          Hide
          stack added a comment -

          After looking again too at the patch, it has too much custom code that is all about region delete.

          It should take a range as Todd suggests earlier rather than list of regions. This means you can not pass a list of discontinuous regions but thats ok I think; just do multiple invocations.

          This seems to have wrong param name and javadoc:

          +  /**
          +   * Gets the count of online regions of the table in a region server.
          +   * This method looks at the in-memory onlineRegions.
          +   * @param regionName
          +   * @return int regions count
          +   * @throws IOException
          +   */
          +  public int getRegionsCount(byte[] regionName) throws IOException;
          

          When I see MasterDeleteRegionTracker, and the equivalent for regionservers, it makes me yearn for a generic framework that these things could run on; it strikes me as too much custom code and custom handlers. This we should fix. We should come up w/ generics that can be customized to do feature specifics.

          Why are we using a janitor to do the delete of regions rather than an executor?

          Why we have this getDeleteRegionTracker ?

          The generic soln Interface would have a method the balancer would check...

          + if (deleteRegionTracker.isDeleteRegionInProgress()) {

          Rather than do the above for every feature we add.

          Should this getDeleteRegionStatusFromDeleteRegionTracker be in the DeleteRegionTracker? And should it be something that is apart from the Master rather than in the master?

          This seems wrong: getDeleteRegionTracker in the MasterServices Interface.

          Why we add it there? Why can't it be independent of Master? Having to have a Master makes it harder to test I'm sure.

          DeleteRegionHandler should not be dealing w/ balancer. That seems dirty.

          This seems racy: waitForInflightSplit

          Do we do this every time? + moveStoreFilesToNewRegionDir(byFamily, fs, tableDir, newRegionInfo);

          If so, is this actually a merge and not a delete?

          Do these methods need to be in HREgionInfo?

          moveDataFromAdjacentRegionToNewRegion
          createNewRegionFromAdjacentRegion

          Could be in HRegion or in a RegionUtil class? RS is already bloated.

          A bunch of these other methods ... adding new region and deleting old region ... would seem to have overlap with existing code where we add regions to meta after open and also w/ merge code?

          We can't have master package refernced in zookeeper package; i.e. see MasterDeleteRegionTracker.

          I've already commented on other stuff in this patch.

          In general the patch is well done. It just adds a bunch of custom facility w/o genericizing at least some aspects so could be used by other features yet to come. In particular, this looks to be a specialization on merge. If so, lets go for merge altogether.

          Show
          stack added a comment - After looking again too at the patch, it has too much custom code that is all about region delete. It should take a range as Todd suggests earlier rather than list of regions. This means you can not pass a list of discontinuous regions but thats ok I think; just do multiple invocations. This seems to have wrong param name and javadoc: + /** + * Gets the count of online regions of the table in a region server. + * This method looks at the in-memory onlineRegions. + * @param regionName + * @ return int regions count + * @ throws IOException + */ + public int getRegionsCount( byte [] regionName) throws IOException; When I see MasterDeleteRegionTracker, and the equivalent for regionservers, it makes me yearn for a generic framework that these things could run on; it strikes me as too much custom code and custom handlers. This we should fix. We should come up w/ generics that can be customized to do feature specifics. Why are we using a janitor to do the delete of regions rather than an executor? Why we have this getDeleteRegionTracker ? The generic soln Interface would have a method the balancer would check... + if (deleteRegionTracker.isDeleteRegionInProgress()) { Rather than do the above for every feature we add. Should this getDeleteRegionStatusFromDeleteRegionTracker be in the DeleteRegionTracker? And should it be something that is apart from the Master rather than in the master? This seems wrong: getDeleteRegionTracker in the MasterServices Interface. Why we add it there? Why can't it be independent of Master? Having to have a Master makes it harder to test I'm sure. DeleteRegionHandler should not be dealing w/ balancer. That seems dirty. This seems racy: waitForInflightSplit Do we do this every time? + moveStoreFilesToNewRegionDir(byFamily, fs, tableDir, newRegionInfo); If so, is this actually a merge and not a delete? Do these methods need to be in HREgionInfo? moveDataFromAdjacentRegionToNewRegion createNewRegionFromAdjacentRegion Could be in HRegion or in a RegionUtil class? RS is already bloated. A bunch of these other methods ... adding new region and deleting old region ... would seem to have overlap with existing code where we add regions to meta after open and also w/ merge code? We can't have master package refernced in zookeeper package; i.e. see MasterDeleteRegionTracker. I've already commented on other stuff in this patch. In general the patch is well done. It just adds a bunch of custom facility w/o genericizing at least some aspects so could be used by other features yet to come. In particular, this looks to be a specialization on merge. If so, lets go for merge altogether.
          Hide
          Ted Yu added a comment -

          Are we going to enhance Merge by allowing to discard data belonging to one of the regions ?
          How should we deal with various failure scenarios in the process of merging ?

          Show
          Ted Yu added a comment - Are we going to enhance Merge by allowing to discard data belonging to one of the regions ? How should we deal with various failure scenarios in the process of merging ?
          Hide
          stack added a comment -

          Reviewing this patch again, could we not obtain this patches's objective with merge? Merge could take a flag which said "True/false copy the data from old regions into the new merge region"

          Show
          stack added a comment - Reviewing this patch again, could we not obtain this patches's objective with merge? Merge could take a flag which said "True/false copy the data from old regions into the new merge region"
          Hide
          Ted Yu added a comment -

          @Todd:
          Do you agree with what Stack said above ?
          If so, I can start reviewing latest patch as well.

          Show
          Ted Yu added a comment - @Todd: Do you agree with what Stack said above ? If so, I can start reviewing latest patch as well.
          Hide
          stack added a comment -

          @Mubarak After looking at FATE and whats involved, I think it a bit much to expect that we build that as a prereq. for this facility. At the same time, lets minimize custom code – code that is particular to the addition of this feature only. Let me do another review of your last patch w/ that in mind.

          Show
          stack added a comment - @Mubarak After looking at FATE and whats involved, I think it a bit much to expect that we build that as a prereq. for this facility. At the same time, lets minimize custom code – code that is particular to the addition of this feature only. Let me do another review of your last patch w/ that in mind.
          Hide
          Mubarak Seyed added a comment -

          @Stack,
          Is there any update on this issue? Do i need to halt the progress of this issue until design is ironed out? Please let me know. Thanks.

          Show
          Mubarak Seyed added a comment - @Stack, Is there any update on this issue? Do i need to halt the progress of this issue until design is ironed out? Please let me know. Thanks.
          Hide
          Todd Lipcon added a comment -

          BTW, I'm also OK with the idea that the API be "deleteRange()" and the initial implementation be limited to only deleting an exact region, with the extension to automatically split and handle multiple regions becoming a followup.

          Show
          Todd Lipcon added a comment - BTW, I'm also OK with the idea that the API be "deleteRange()" and the initial implementation be limited to only deleting an exact region, with the extension to automatically split and handle multiple regions becoming a followup.
          Hide
          Todd Lipcon added a comment -

          Hey folks, sorry for being remiss in following this closely. Been a busy week! My thinking is that, while a FATE-like general framework would be nice, it's not a prerequisite for this feature.

          However, having this feature properly handle failure cases is a prerequisite. I was thinking that, rather than handling it one-off for this feature, it might only be incrementally more work to do something like FATE and validate the new framework by developing this feature on top of it. It's a longer path to the end goal, but will result in something much more reusable for other similar features in the future (as well as some previous stuff being simplified).

          Show
          Todd Lipcon added a comment - Hey folks, sorry for being remiss in following this closely. Been a busy week! My thinking is that, while a FATE-like general framework would be nice, it's not a prerequisite for this feature. However, having this feature properly handle failure cases is a prerequisite. I was thinking that, rather than handling it one-off for this feature, it might only be incrementally more work to do something like FATE and validate the new framework by developing this feature on top of it. It's a longer path to the end goal, but will result in something much more reusable for other similar features in the future (as well as some previous stuff being simplified).
          Hide
          Mubarak Seyed added a comment -

          I don't see how plan of ' 01/Feb/12 07:43' lays foundation for a generic framework. Am I missing something? It seems like its code for this feature only?

          My initial skeleton was only limited to 4991 implementation, framework idea was not part of intent of 4991. Todd's comment in code review @ '16/Feb/2012 7:42' (https://reviews.apache.org/r/3917/) brought the idea of Accumulo's FATE and it turns out as generic framework.

          Show
          Mubarak Seyed added a comment - I don't see how plan of ' 01/Feb/12 07:43' lays foundation for a generic framework. Am I missing something? It seems like its code for this feature only? My initial skeleton was only limited to 4991 implementation, framework idea was not part of intent of 4991. Todd's comment in code review @ '16/Feb/2012 7:42' ( https://reviews.apache.org/r/3917/ ) brought the idea of Accumulo's FATE and it turns out as generic framework.
          Hide
          Lars Hofhansl added a comment -

          Maybe we should separate this feature from a generic framework?

          For this issue we could just have one API: deleteRange(table, startKey, endKey). Initially it could validate that the start and endKey coincide with exactly one region, that way we can extend this later, without having regions exposed in the API.
          (still need to avoid races with splitting and balancing of course - makes it almost nicer to go back to the original approach of passing a region name).

          Just my $0.02.

          Show
          Lars Hofhansl added a comment - Maybe we should separate this feature from a generic framework? For this issue we could just have one API: deleteRange(table, startKey, endKey). Initially it could validate that the start and endKey coincide with exactly one region, that way we can extend this later, without having regions exposed in the API. (still need to avoid races with splitting and balancing of course - makes it almost nicer to go back to the original approach of passing a region name). Just my $0.02.
          Hide
          stack added a comment -

          I feel some of the recent proposals / requirements are far more complex than the one

          Yeah. It seemed basic back in December.

          There wasn't such requirement when Mubarak outlined his plan

          Pardon me. I should have noticed the plan but did not. Other priorities. If I'd seen the plan I'd have blanched I think.

          Of course, having generic framework for all the master-coordinated tasks allows future additions to be concise.

          Yep. We'd have tested, proven primitives to build stuff on rather than have to do it per feature

          But I think that should have been outlined clearly in the early stage of development of a feature.

          See above. Pardon me for missing how involved this addition became.

          I don't see how plan of ' 01/Feb/12 07:43' lays foundation for a generic framework. Am I missing something? It seems like its code for this feature only?

          Show
          stack added a comment - I feel some of the recent proposals / requirements are far more complex than the one Yeah. It seemed basic back in December. There wasn't such requirement when Mubarak outlined his plan Pardon me. I should have noticed the plan but did not. Other priorities. If I'd seen the plan I'd have blanched I think. Of course, having generic framework for all the master-coordinated tasks allows future additions to be concise. Yep. We'd have tested, proven primitives to build stuff on rather than have to do it per feature But I think that should have been outlined clearly in the early stage of development of a feature. See above. Pardon me for missing how involved this addition became. I don't see how plan of ' 01/Feb/12 07:43' lays foundation for a generic framework. Am I missing something? It seems like its code for this feature only?
          Hide
          Ted Yu added a comment -

          I feel some of the recent proposals / requirements are far more complex than the one @ 09/Dec/11 16:48
          There wasn't such requirement when Mubarak outlined his plan @ 01/Feb/12 07:43, either.

          Of course, having generic framework for all the master-coordinated tasks allows future additions to be concise. But I think that should have been outlined clearly in the early stage of development of a feature.

          Consider, that few of us know how Accumulo does similar tasks, we should plan / brainstorm / design carefully so that the generic framework comes up to our expectations. This would take a month, at least.
          e.g. once an action fails, we need to decide whether we should retry or rollback. For retry, we need a method to specify timeout or the number of retries. For rollback, another action needs to be associated with the original action.

          Consider, that the refactoring Mubarak has agreed to do is on the common path of applying the generic framework. I think there is value following the plan specified @ 01/Feb/12 07:43

          Once the generic framework is developed, we can revisit this feature (and instant schema change). I would anticipate some refinement of the framework in the process of integration.

          @Stack, @Todd: what do you think ?

          Show
          Ted Yu added a comment - I feel some of the recent proposals / requirements are far more complex than the one @ 09/Dec/11 16:48 There wasn't such requirement when Mubarak outlined his plan @ 01/Feb/12 07:43, either. Of course, having generic framework for all the master-coordinated tasks allows future additions to be concise. But I think that should have been outlined clearly in the early stage of development of a feature. Consider, that few of us know how Accumulo does similar tasks, we should plan / brainstorm / design carefully so that the generic framework comes up to our expectations. This would take a month, at least. e.g. once an action fails, we need to decide whether we should retry or rollback. For retry, we need a method to specify timeout or the number of retries. For rollback, another action needs to be associated with the original action. Consider, that the refactoring Mubarak has agreed to do is on the common path of applying the generic framework. I think there is value following the plan specified @ 01/Feb/12 07:43 Once the generic framework is developed, we can revisit this feature (and instant schema change). I would anticipate some refinement of the framework in the process of integration. @Stack, @Todd: what do you think ?
          Hide
          Mubarak Seyed added a comment -

          How does Accumulo do it do you know? You might get some ideas over there.

          Will take a look. Todd's preso highlights comparison of HBase vs Accumulo - http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012

          Source:
          https://svn.apache.org/repos/asf/incubator/accumulo/branches/1.4/src/server/src/main/java/org/apache/accumulo/server/fate/
          (Master-coordinated tasks uses Fate, refer TStore.java)
          https://svn.apache.org/repos/asf/incubator/accumulo/branches/1.4/src/server/src/main/java/org/apache/accumulo/server/master/Master.java

          Notes from TStore.java

          /**
           * Transaction Store: a place to save transactions
           * 
           * A transaction consists of a number of operations. To use, first create a transaction id, and then seed the
           * transaction with an initial operation. An executor service can then execute the transaction's operation,
           * possibly pushing more operations onto the transaction as each step successfully completes.
           * If a step fails, the stack can be unwound, undoing each operation.
           */
          

          For example, delete-range operation in master uses fate to seed transaction with an DELETE_RANGE table operation, submit a task, executor service can then execute the op.

          public void executeTableOperation(TInfo tinfo, AuthInfo c, long opid, org.apache.accumulo.core.master.thrift.TableOperation op, List<ByteBuffer> arguments, Map<String,String> options, boolean autoCleanup){
          
          case DELETE_RANGE: {
                    String tableName = ByteBufferUtil.toString(arguments.get(0));
                    Text startRow = ByteBufferUtil.toText(arguments.get(1));
                    Text endRow = ByteBufferUtil.toText(arguments.get(2));
                    
                    final String tableId = checkTableId(tableName, TableOperation.DELETE_RANGE);
                    checkNotMetadataTable(tableName, TableOperation.DELETE_RANGE);
                    verify(c, tableId, TableOperation.DELETE_RANGE, check(c, SystemPermission.SYSTEM) || check(c, tableId, TablePermission.WRITE));
                    
                    fate.seedTransaction(opid, new TraceRepo<Master>(new TableRangeOp(MergeInfo.Operation.DELETE, tableId, startRow, endRow)), autoCleanup);
                    break;
                  }
          }
          
          Show
          Mubarak Seyed added a comment - How does Accumulo do it do you know? You might get some ideas over there. Will take a look. Todd's preso highlights comparison of HBase vs Accumulo - http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012 Source: https://svn.apache.org/repos/asf/incubator/accumulo/branches/1.4/src/server/src/main/java/org/apache/accumulo/server/fate/ (Master-coordinated tasks uses Fate, refer TStore.java) https://svn.apache.org/repos/asf/incubator/accumulo/branches/1.4/src/server/src/main/java/org/apache/accumulo/server/master/Master.java Notes from TStore.java /** * Transaction Store: a place to save transactions * * A transaction consists of a number of operations. To use, first create a transaction id, and then seed the * transaction with an initial operation. An executor service can then execute the transaction's operation, * possibly pushing more operations onto the transaction as each step successfully completes. * If a step fails, the stack can be unwound, undoing each operation. */ For example, delete-range operation in master uses fate to seed transaction with an DELETE_RANGE table operation, submit a task, executor service can then execute the op. public void executeTableOperation(TInfo tinfo, AuthInfo c, long opid, org.apache.accumulo.core.master.thrift.TableOperation op, List<ByteBuffer> arguments, Map< String , String > options, boolean autoCleanup){ case DELETE_RANGE: { String tableName = ByteBufferUtil.toString(arguments.get(0)); Text startRow = ByteBufferUtil.toText(arguments.get(1)); Text endRow = ByteBufferUtil.toText(arguments.get(2)); final String tableId = checkTableId(tableName, TableOperation.DELETE_RANGE); checkNotMetadataTable(tableName, TableOperation.DELETE_RANGE); verify(c, tableId, TableOperation.DELETE_RANGE, check(c, SystemPermission.SYSTEM) || check(c, tableId, TablePermission.WRITE)); fate.seedTransaction(opid, new TraceRepo<Master>( new TableRangeOp(MergeInfo.Operation.DELETE, tableId, startRow, endRow)), autoCleanup); break ; } }
          Hide
          stack added a comment -

          Do we need to design the intent part (and steps of an operation) as a generic framework for all the master-coordinated tasks?

          I'd think you could make it work for you but make it so it could be others. How does Accumulo do it do you know? You might get some ideas over there.

          I thought we are only changing the API (with multiple region support) and focussing more on refactoring with good test/stress-test in this JIRA.

          You mean to get this facility into core?

          My sense is that you could get this specialized lump into hbase to do this one facility if lots of tests but my fear is that if it does go in, it'll live forever as an awkward appendage. Seems like we have an opportunity to add some base primitives that we can then build this feature on as well as others. Pity to waste it (understood if you don't want to do the generalized system).

          Can we address intent/actions part out of scope of this JIRA?

          I'm reluctant to because of the above – we'll get a specialized lump of code that will live forever and we'll all be afraid to touch.

          Maybe others think different.

          Show
          stack added a comment - Do we need to design the intent part (and steps of an operation) as a generic framework for all the master-coordinated tasks? I'd think you could make it work for you but make it so it could be others. How does Accumulo do it do you know? You might get some ideas over there. I thought we are only changing the API (with multiple region support) and focussing more on refactoring with good test/stress-test in this JIRA. You mean to get this facility into core? My sense is that you could get this specialized lump into hbase to do this one facility if lots of tests but my fear is that if it does go in, it'll live forever as an awkward appendage. Seems like we have an opportunity to add some base primitives that we can then build this feature on as well as others. Pity to waste it (understood if you don't want to do the generalized system). Can we address intent/actions part out of scope of this JIRA? I'm reluctant to because of the above – we'll get a specialized lump of code that will live forever and we'll all be afraid to touch. Maybe others think different.
          Hide
          Mubarak Seyed added a comment -

          @Stack,

          The log of intent would list the steps to be done
          Is it Okay to do the above in another JIRA ?
          As a prereq for this issue? Yes.

          Please correct me if i am wrong, if we introduce prereq then we need to design (as you asked) the intent part (a bare mini language..) and actual delete_region implementation needs to interpret (as Ted mentioned) the steps from intent (from ZK). Do we need to design the intent part (and steps of an operation) as a generic framework for all the master-coordinated tasks?

          I thought we are only changing the API (with multiple region support) and focussing more on refactoring with good test/stress-test in this JIRA.

          Can we address intent/actions part out of scope of this JIRA?

          Thoughts?

          Thanks.

          Show
          Mubarak Seyed added a comment - @Stack, The log of intent would list the steps to be done Is it Okay to do the above in another JIRA ? As a prereq for this issue? Yes. Please correct me if i am wrong, if we introduce prereq then we need to design (as you asked) the intent part (a bare mini language..) and actual delete_region implementation needs to interpret (as Ted mentioned) the steps from intent (from ZK). Do we need to design the intent part (and steps of an operation) as a generic framework for all the master-coordinated tasks? I thought we are only changing the API (with multiple region support) and focussing more on refactoring with good test/stress-test in this JIRA. Can we address intent/actions part out of scope of this JIRA? Thoughts? Thanks.
          Hide
          stack added a comment -

          i have tried with getting a List<HRegion>, got into serialization issue

          Yeah. HRegion is not a Writable

          We are using znode just to start the task and update the state only. If we keep track of intent vs action in same znode, considering the size of data in znode, we should not exceed> 1 MB as ZK admin guide says

          Oh, you are talking of writing actual data into zk? I was just talking of intent, a bare mini language that outlines steps to complete an operation... something like your enums. I'd think this would be well under 1MB.

          Good stuff. Might make sense to work on a bit of a design doc first?

          Show
          stack added a comment - i have tried with getting a List<HRegion>, got into serialization issue Yeah. HRegion is not a Writable We are using znode just to start the task and update the state only. If we keep track of intent vs action in same znode, considering the size of data in znode, we should not exceed> 1 MB as ZK admin guide says Oh, you are talking of writing actual data into zk? I was just talking of intent, a bare mini language that outlines steps to complete an operation... something like your enums. I'd think this would be well under 1MB. Good stuff. Might make sense to work on a bit of a design doc first?
          Hide
          Mubarak Seyed added a comment -

          could use: 'List<HRegionInfo> getOnlineRegions() ' HRIs have tablename. Could figure it client-side

          Make sense, will do it in client-side, i have tried with getting a List<HRegion>, got into serialization issue

          Will we have a new dir in zk per cluster region operation we want to do? Can we not exploit primitives added by hbase-4213? Or do we need to refactor hbase-4213 to get you primitives you need to do this facility? Or is there nothing in common w/ what hbase-4213 does (there is at least the closing of a region?)

          We can certainly refactor HBASE-4213 MasterSchemaChangeTracker and SchemaChangeTracker, we can club them as single tracker on each side (Master and RS), will dig more.

          online merge should have a bunch of overlap w/ this feature? Would be great if they could share a bunch of code/primitives.

          Sure, more refactoring

          As has been suggested, rather than a /delete-region, instead we'd have a log of intent+log of actions thing up in zk I suppose. The log of intent would list the steps to be done and then the log of actions thingy would log how far the operation had gone

          We are using znode just to start the task and update the state only. If we keep track of intent vs action in same znode, considering the size of data in znode, we should not exceed> 1 MB as ZK admin guide says

          jute.maxbuffer

          (Java system property: jute.maxbuffer)
          This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.

          Can we address it in another JIRA as we need to persist znode with Map<Intent, Action>?

          Thats a bit broken in my opinion. Its wonky having zk have reference out to other main packages. Not your fault. Should have caught that in review of hbase-4213.

          Well, we can still refactor (club schema change trackers and delete-region trackers into one) and place them into appropriate package?

          Show
          Mubarak Seyed added a comment - could use: 'List<HRegionInfo> getOnlineRegions() ' HRIs have tablename. Could figure it client-side Make sense, will do it in client-side, i have tried with getting a List<HRegion>, got into serialization issue Will we have a new dir in zk per cluster region operation we want to do? Can we not exploit primitives added by hbase-4213? Or do we need to refactor hbase-4213 to get you primitives you need to do this facility? Or is there nothing in common w/ what hbase-4213 does (there is at least the closing of a region?) We can certainly refactor HBASE-4213 MasterSchemaChangeTracker and SchemaChangeTracker, we can club them as single tracker on each side (Master and RS), will dig more. online merge should have a bunch of overlap w/ this feature? Would be great if they could share a bunch of code/primitives. Sure, more refactoring As has been suggested, rather than a /delete-region, instead we'd have a log of intent+log of actions thing up in zk I suppose. The log of intent would list the steps to be done and then the log of actions thingy would log how far the operation had gone We are using znode just to start the task and update the state only. If we keep track of intent vs action in same znode, considering the size of data in znode, we should not exceed> 1 MB as ZK admin guide says jute.maxbuffer (Java system property: jute.maxbuffer) This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size. Can we address it in another JIRA as we need to persist znode with Map<Intent, Action>? Thats a bit broken in my opinion. Its wonky having zk have reference out to other main packages. Not your fault. Should have caught that in review of hbase-4213. Well, we can still refactor (club schema change trackers and delete-region trackers into one) and place them into appropriate package?
          Hide
          stack added a comment -

          Is it Okay to do the above in another JIRA ?

          As a prereq for this issue? Yes.

          Show
          stack added a comment - Is it Okay to do the above in another JIRA ? As a prereq for this issue? Yes.
          Hide
          Ted Yu added a comment -

          The log of intent would list the steps to be done

          This would introduce some kind of interpreter.
          Is it Okay to do the above in another JIRA ?

          Show
          Ted Yu added a comment - The log of intent would list the steps to be done This would introduce some kind of interpreter. Is it Okay to do the above in another JIRA ?
          Hide
          Mubarak Seyed added a comment -

          So, you can only do one region at a time? Why would it be hard doing multiple given you are tracking? Or is it that it makes the tracking yet more complicated?

          As long as start/end keys are in region(s) boundary, we can do multiple regions, each region's encoded name will be added under <zookeeper.znode.parent>/delete-region as one znode.

          For example, we have regions R1, R2, R3 and R4, what i meant was if startKey is startKey of R1, spans across R2, R3 and endKey is some value other than endKey of R4 then we will error out an entire task. I believe 4th point in my previous comment at 23/Feb/12 02:43 needs to be removed as 3rd point covers well. Sorry for the confusion.

          Show
          Mubarak Seyed added a comment - So, you can only do one region at a time? Why would it be hard doing multiple given you are tracking? Or is it that it makes the tracking yet more complicated? As long as start/end keys are in region(s) boundary, we can do multiple regions, each region's encoded name will be added under <zookeeper.znode.parent>/delete-region as one znode. For example, we have regions R1, R2, R3 and R4, what i meant was if startKey is startKey of R1, spans across R2, R3 and endKey is some value other than endKey of R4 then we will error out an entire task. I believe 4th point in my previous comment at 23/Feb/12 02:43 needs to be removed as 3rd point covers well. Sorry for the confusion.
          Hide
          stack added a comment -

          Do you need tablename, startkey, endkey? Can't you just pass region name?

          Again, you may be trying to not reveal region type in API but then delete_region would be different to close_region which takes a regionname IIRC?

          ..but client does not know the tableName for a regionName in our case...

          It does not know? I may be missing context. If I am, ignore this comment. If it did know, could use: 'List<HRegionInfo> getOnlineRegions() ' HRIs have tablename. Could figure it client-side. You can't use the List<HRegion> because we can't serialize HRegion to pass over connection... that call is if you are running in same context in JSP or in a unit test or something.

          Design choice is like HBASE-4213, meaning master create a znode under <zookeeper.znode.parent>/delete-region....

          Fair enough. Will we have a new dir in zk per cluster region operation we want to do? Can we not exploit primitives added by hbase-4213? Or do we need to refactor hbase-4213 to get you primitives you need to do this facility? Or is there nothing in common w/ what hbase-4213 does (there is at least the closing of a region?)

          ...If we are considering delete_region as a tool/util then we can refactor as a tool/util as like Online/Offline merge code....

          online merge should have a bunch of overlap w/ this feature? Would be great if they could share a bunch of code/primitives. As has been suggested, rather than a /delete-region, instead we'd have a log of intent+log of actions thing up in zk I suppose. The log of intent would list the steps to be done and then the log of actions thingy would log how far the operation had gone (I should read up on the cited accumulo doo-hickey).

          We do put all our ZK trackers in zookeeper package and this is how online schema change HBASE-4213 was implemented.

          Thats a bit broken in my opinion. Its wonky having zk have reference out to other main packages. Not your fault. Should have caught that in review of hbase-4213.

          Good on you Mubarak.

          Show
          stack added a comment - Do you need tablename, startkey, endkey? Can't you just pass region name? Again, you may be trying to not reveal region type in API but then delete_region would be different to close_region which takes a regionname IIRC? ..but client does not know the tableName for a regionName in our case... It does not know? I may be missing context. If I am, ignore this comment. If it did know, could use: 'List<HRegionInfo> getOnlineRegions() ' HRIs have tablename. Could figure it client-side. You can't use the List<HRegion> because we can't serialize HRegion to pass over connection... that call is if you are running in same context in JSP or in a unit test or something. Design choice is like HBASE-4213 , meaning master create a znode under <zookeeper.znode.parent>/delete-region.... Fair enough. Will we have a new dir in zk per cluster region operation we want to do? Can we not exploit primitives added by hbase-4213? Or do we need to refactor hbase-4213 to get you primitives you need to do this facility? Or is there nothing in common w/ what hbase-4213 does (there is at least the closing of a region?) ...If we are considering delete_region as a tool/util then we can refactor as a tool/util as like Online/Offline merge code.... online merge should have a bunch of overlap w/ this feature? Would be great if they could share a bunch of code/primitives. As has been suggested, rather than a /delete-region, instead we'd have a log of intent+log of actions thing up in zk I suppose. The log of intent would list the steps to be done and then the log of actions thingy would log how far the operation had gone (I should read up on the cited accumulo doo-hickey). We do put all our ZK trackers in zookeeper package and this is how online schema change HBASE-4213 was implemented. Thats a bit broken in my opinion. Its wonky having zk have reference out to other main packages. Not your fault. Should have caught that in review of hbase-4213. Good on you Mubarak.
          Hide
          stack added a comment -

          Shell command needs to be changed as delete_region <table_name> <start_key> <end_key>

          delete_region would go well w/ our current close_region. Do you need tablename, startkey, endkey? Can't you just pass region name?

          ditto for the deleteRegion call (though maybe I'm missing the fact that you are trying to respect Todd's comments above that we not have region come up out of the API – ignore this remark if so).

          If start/end key for the specified table is spanned across multiple regions then it is out of scope of this JIRA (throw error).

          So, you can only do one region at a time? Why would it be hard doing multiple given you are tracking? Or is it that it makes the tracking yet more complicated?

          Show
          stack added a comment - Shell command needs to be changed as delete_region <table_name> <start_key> <end_key> delete_region would go well w/ our current close_region. Do you need tablename, startkey, endkey? Can't you just pass region name? ditto for the deleteRegion call (though maybe I'm missing the fact that you are trying to respect Todd's comments above that we not have region come up out of the API – ignore this remark if so). If start/end key for the specified table is spanned across multiple regions then it is out of scope of this JIRA (throw error). So, you can only do one region at a time? Why would it be hard doing multiple given you are tracking? Or is it that it makes the tracking yet more complicated?
          Hide
          Mubarak Seyed added a comment -

          From the above discussions, this is what i am thinking to implement

          • Shell command needs to be changed as delete_region <table_name> <start_key> <end_key>
          • API will be changed to deleteRegion(tableName, startKey, endKey) in HBaseAdmin
          • If there is no match for start/end key for the specified table then just error out
          • If start/end key for the specified table is spanned across multiple regions then it is out of scope of this JIRA (throw error).

          Thanks.

          Show
          Mubarak Seyed added a comment - From the above discussions, this is what i am thinking to implement Shell command needs to be changed as delete_region <table_name> <start_key> <end_key> API will be changed to deleteRegion(tableName, startKey, endKey) in HBaseAdmin If there is no match for start/end key for the specified table then just error out If start/end key for the specified table is spanned across multiple regions then it is out of scope of this JIRA (throw error). Thanks.
          Hide
          Mubarak Seyed added a comment -

          How about deleteRegion(tableName, startKey, endKey)?

          Show
          Mubarak Seyed added a comment - How about deleteRegion(tableName, startKey, endKey) ?
          Hide
          Mubarak Seyed added a comment -

          If we specify only start and end key, what happens if multiple regions (belongs to mutiple tables) holds the same start/end key?

          For example: I have 3 tables => table1, table2, table3
          If i pre-split them using first two characters like [0-9][a-z]=> we get 36 x 36 = 1296 regions => 1296 x 3 = 3888 total regions (across all RSs), so key would look like

          a0.... to az.....
          ..
          z0... to zz...
          ..
          01... to 0z...

          so we end up

          table1,region1,start_key1,end_key1
          table2,region1,start_key1,end_key1
          table3,region1,start_key1,end_key1

          start_key1 and end_key1 needs to be resolved at input level unless fully qualified path of table1,region1,start_key1 and table1,region1,end_key1 is specified.

          Show
          Mubarak Seyed added a comment - If we specify only start and end key, what happens if multiple regions (belongs to mutiple tables) holds the same start/end key? For example: I have 3 tables => table1, table2, table3 If i pre-split them using first two characters like [0-9] [a-z] => we get 36 x 36 = 1296 regions => 1296 x 3 = 3888 total regions (across all RSs), so key would look like a0.... to az..... .. z0... to zz... .. 01... to 0z... so we end up table1,region1,start_key1,end_key1 table2,region1,start_key1,end_key1 table3,region1,start_key1,end_key1 start_key1 and end_key1 needs to be resolved at input level unless fully qualified path of table1,region1,start_key1 and table1,region1,end_key1 is specified.
          Hide
          Mubarak Seyed added a comment -

          @Stack
          Thanks for the review

          Do we need to add this method to the region server interface?

          +  public int getRegionsCount(byte[] regionName) throws IOException;
          

          Can we not just count what comes back from the get on online regions?

          We need to get the counts per table, getOnlineRegions returns List<HRegion> for a table but client does not know the tableName for a regionName in our case, either we can do two calls (one for getting HRegionInfo and get tableName from there and next one for getting List<HRegion>). I thought we can simplify by adding a new interface.

          public int getRegionsCount(byte[] regionName) throws IOException {
              return getOnlineRegions(getRegionInfo(regionName).getTableName())
                  .size();
            }
          

          Do we have to run the region delete in the Master process? Can the client not do it?

          Design choice is like HBASE-4213, meaning master create a znode under <zookeeper.znode.parent>/delete-region then RS trackers are getting notified of children changed, then a RS which hosts the region to-be-deleted will process the delete-region request and update the state in
          <zookeeper.znode.parent>/delete-region/<encoded-region-name-to-be-deleted> znode.

          Is it really necessary adding + public MasterDeleteRegionTracker getDeleteRegionTracker(); to the MasterServices? This will have a ripple effect through Tests and it seems like a bit of an exotic API to have in this basic Interface.

          Will think a bit more and update you

          Does all of this new code need to be in HRegionServer? Can it live in a class of its own?

          Like to hear the comments from code review, we can refactor to helper class.

          There must be a million holes here (HRS crashes in middle of file moving or creation of the merged region, files partially moved or deleted....).

          I believe delete-region state in ZK will help to recover from failures, need more testcases with individual failure scenarios such as HRS crash, failure of merged region, failure of file remove in HDFS, failure of new region directory creation in HDFS, partial files, etc, will add them when i do stress test for Todd's suggestion

          Does this code all need to be in core? Can we not make a few primitives and then run it all from outside in a tool or script w/ state recorded as we go so can resume if fail mid-way? There are a bunch of moving pieces here. Its all bundled up in core code so its going to be tough to test.

          If we are considering delete_region as a tool/util then we can refactor as a tool/util as like Online/Offline merge code.

          Adding this to onlineregions, + public void deleteRegion(String regionName) throws IOException, KeeperException;, do all removals from online regions now use this new API (Its probably good having it here... but just wondering about the places where regions currently get removed from online map, do they go a different route than this new one?)

          New API deleteRegion() does more than just removing from online region(s) map but other places we use

          public boolean removeFromOnlineRegions(final String encodedName)
          

          Its being called from openRegion(), refreshRegion(), createDaughters() in SplitTransaction and CloseRegionHandler

          How hard will it be to reuse parts to do say an online merge of a bunch of adjacent regions?

          Once Todd's proposal is implemented, will find out a way to do more refactoring (to cut down repeated code)

          Are the enums duplicated?

          Yes, will take care of it in refactoring.

          Why does zookeeper package have classes particular to master and regionserver?

          We do put all our ZK trackers in zookeeper package and this is how online schema change HBASE-4213 was implemented.

          Show
          Mubarak Seyed added a comment - @Stack Thanks for the review Do we need to add this method to the region server interface? + public int getRegionsCount( byte [] regionName) throws IOException; Can we not just count what comes back from the get on online regions? We need to get the counts per table, getOnlineRegions returns List<HRegion> for a table but client does not know the tableName for a regionName in our case, either we can do two calls (one for getting HRegionInfo and get tableName from there and next one for getting List<HRegion> ). I thought we can simplify by adding a new interface. public int getRegionsCount( byte [] regionName) throws IOException { return getOnlineRegions(getRegionInfo(regionName).getTableName()) .size(); } Do we have to run the region delete in the Master process? Can the client not do it? Design choice is like HBASE-4213 , meaning master create a znode under <zookeeper.znode.parent>/delete-region then RS trackers are getting notified of children changed, then a RS which hosts the region to-be-deleted will process the delete-region request and update the state in <zookeeper.znode.parent>/delete-region/<encoded-region-name-to-be-deleted> znode. Is it really necessary adding + public MasterDeleteRegionTracker getDeleteRegionTracker(); to the MasterServices? This will have a ripple effect through Tests and it seems like a bit of an exotic API to have in this basic Interface. Will think a bit more and update you Does all of this new code need to be in HRegionServer? Can it live in a class of its own? Like to hear the comments from code review, we can refactor to helper class. There must be a million holes here (HRS crashes in middle of file moving or creation of the merged region, files partially moved or deleted....). I believe delete-region state in ZK will help to recover from failures, need more testcases with individual failure scenarios such as HRS crash, failure of merged region, failure of file remove in HDFS, failure of new region directory creation in HDFS, partial files, etc, will add them when i do stress test for Todd's suggestion Does this code all need to be in core? Can we not make a few primitives and then run it all from outside in a tool or script w/ state recorded as we go so can resume if fail mid-way? There are a bunch of moving pieces here. Its all bundled up in core code so its going to be tough to test. If we are considering delete_region as a tool/util then we can refactor as a tool/util as like Online/Offline merge code. Adding this to onlineregions, + public void deleteRegion(String regionName) throws IOException, KeeperException;, do all removals from online regions now use this new API (Its probably good having it here... but just wondering about the places where regions currently get removed from online map, do they go a different route than this new one?) New API deleteRegion() does more than just removing from online region(s) map but other places we use public boolean removeFromOnlineRegions( final String encodedName) Its being called from openRegion() , refreshRegion() , createDaughters() in SplitTransaction and CloseRegionHandler How hard will it be to reuse parts to do say an online merge of a bunch of adjacent regions? Once Todd's proposal is implemented, will find out a way to do more refactoring (to cut down repeated code) Are the enums duplicated? Yes, will take care of it in refactoring. Why does zookeeper package have classes particular to master and regionserver? We do put all our ZK trackers in zookeeper package and this is how online schema change HBASE-4213 was implemented.
          Hide
          stack added a comment -

          @Mubarak

          Do we need to add this method to the region server interface?

          +  public int getRegionsCount(byte[] regionName) throws IOException;
          

          Can we not just count what comes back from the get on online regions?

          Do we have to run the region delete in the Master process? Can the client not do it?

          Is it really necessary adding + public MasterDeleteRegionTracker getDeleteRegionTracker(); to the MasterServices? This will have a ripple effect through Tests and it seems like a bit of an exotic API to have in this basic Interface.

          I like the refactor in HRegion.

          Does all of this new code need to be in HRegionServer? Can it live in a class of its own?

          There must be a million holes here (HRS crashes in middle of file moving or creation of the merged region, files partially moved or deleted....).

          Does this code all need to be in core? Can we not make a few primitives and then run it all from outside in a tool or script w/ state recorded as we go so can resume if fail mid-way? There are a bunch of moving pieces here. Its all bundled up in core code so its going to be tough to test.

          Adding this to onlineregions, + public void deleteRegion(String regionName) throws IOException, KeeperException;, do all removals from online regions now use this new API (Its probably good having it here... but just wondering about the places where regions currently get removed from online map, do they go a different route than this new one?)

          Hmmmm... looks like a bunch of state is being tracked in zk. Thats good. Its all custom to this feature. How hard will it be to reuse parts to do say an online merge of a bunch of adjacent regions?

          Yeah, there is a lot of moving parts... a master delete tracker and a regionserver delete tracker... I've not done an extensive review of design but that seems pretty heavy going.

          Are the enums duplicated?

          Why does zookeeper package have classes particular to master and regionserver?

          Show
          stack added a comment - @Mubarak Do we need to add this method to the region server interface? + public int getRegionsCount( byte [] regionName) throws IOException; Can we not just count what comes back from the get on online regions? Do we have to run the region delete in the Master process? Can the client not do it? Is it really necessary adding + public MasterDeleteRegionTracker getDeleteRegionTracker(); to the MasterServices? This will have a ripple effect through Tests and it seems like a bit of an exotic API to have in this basic Interface. I like the refactor in HRegion. Does all of this new code need to be in HRegionServer? Can it live in a class of its own? There must be a million holes here (HRS crashes in middle of file moving or creation of the merged region, files partially moved or deleted....). Does this code all need to be in core? Can we not make a few primitives and then run it all from outside in a tool or script w/ state recorded as we go so can resume if fail mid-way? There are a bunch of moving pieces here. Its all bundled up in core code so its going to be tough to test. Adding this to onlineregions, + public void deleteRegion(String regionName) throws IOException, KeeperException;, do all removals from online regions now use this new API (Its probably good having it here... but just wondering about the places where regions currently get removed from online map, do they go a different route than this new one?) Hmmmm... looks like a bunch of state is being tracked in zk. Thats good. Its all custom to this feature. How hard will it be to reuse parts to do say an online merge of a bunch of adjacent regions? Yeah, there is a lot of moving parts... a master delete tracker and a regionserver delete tracker... I've not done an extensive review of design but that seems pretty heavy going. Are the enums duplicated? Why does zookeeper package have classes particular to master and regionserver?
          Hide
          Mubarak Seyed added a comment -

          Thanks Todd. Will come up with testcases and run the stress test. Thanks.

          Show
          Mubarak Seyed added a comment - Thanks Todd. Will come up with testcases and run the stress test. Thanks.
          Hide
          Todd Lipcon added a comment -

          I'm cool with the API now. I'm a bit nervous about the complexity of the implementation - the fact that this (and the online schema change) have to do things like disable splits and disable balancing is a little "smelly" to me. But, I don't necessarily have a better idea.

          In terms of testing before commit, I'd like to see a kind of stress test which can be run continuously that does something like the following in a loop:

          • create a table with 10 regions
          • insert 10 rows into each region
          • pick a random order to delete the regions
          • start some threads which perform full table scans to count rows. concurrently, delete each of the regions in the above random order
          • number of rows should drop 100->90->80, etc.
          • drop table

          If we can run this test continuously, and the test can still succeed even if we periodically kill the master or one of the RSes, it seems sufficient.

          Show
          Todd Lipcon added a comment - I'm cool with the API now. I'm a bit nervous about the complexity of the implementation - the fact that this (and the online schema change) have to do things like disable splits and disable balancing is a little "smelly" to me. But, I don't necessarily have a better idea. In terms of testing before commit, I'd like to see a kind of stress test which can be run continuously that does something like the following in a loop: create a table with 10 regions insert 10 rows into each region pick a random order to delete the regions start some threads which perform full table scans to count rows. concurrently, delete each of the regions in the above random order number of rows should drop 100->90->80, etc. drop table If we can run this test continuously, and the test can still succeed even if we periodically kill the master or one of the RSes, it seems sufficient.
          Hide
          Mubarak Seyed added a comment -

          @Todd, @Nicolas
          Can i go ahead with changes or we need more time for brainstorming? Thanks.

          Show
          Mubarak Seyed added a comment - @Todd, @Nicolas Can i go ahead with changes or we need more time for brainstorming? Thanks.
          Hide
          Ted Yu added a comment -

          The introduction of new (range) delete marker should be considered carefully, as Lars mentioned.
          It has been 2 weeks since Mubarak outlined skeleton for current solution. It would be nice if questions about the approach were raised earlier.
          So far, feedback from Stack, Todd and Lars seem to align with the current solution, with changes to the specification of start and end keys.

          I think Mubarak can go ahead and make corresponding changes.

          Show
          Ted Yu added a comment - The introduction of new (range) delete marker should be considered carefully, as Lars mentioned. It has been 2 weeks since Mubarak outlined skeleton for current solution. It would be nice if questions about the approach were raised earlier. So far, feedback from Stack, Todd and Lars seem to align with the current solution, with changes to the specification of start and end keys. I think Mubarak can go ahead and make corresponding changes.
          Hide
          Lars Hofhansl added a comment -

          I had thought of something similar in HBASE-5268, but it turns out that is not so simple. Maybe a range delete marker is easier to implement as the beginning is a known point; still how would a scan know what deleted markers would be in effect?
          We'd have to store all range delete markers at the beginning of the file, and then seek there on every scan. There's there question about when it is safe to collect the range marker, etc.

          I agree with Todd on the API as well. At the same time I agree with Ted & co. Being able to delete an entire region is powerful (especially with pluggable RegionSplitPolicies - HBASE-5304).
          So generally I'd prefer an implementation that either enforces start-range and end-range to coincide with the a region (although that is awkward), or have logic that splits the two adjacent regions and deletes the whole range by removing a range of regions (including the newly created partial regions).

          Show
          Lars Hofhansl added a comment - I had thought of something similar in HBASE-5268 , but it turns out that is not so simple. Maybe a range delete marker is easier to implement as the beginning is a known point; still how would a scan know what deleted markers would be in effect? We'd have to store all range delete markers at the beginning of the file, and then seek there on every scan. There's there question about when it is safe to collect the range marker, etc. I agree with Todd on the API as well. At the same time I agree with Ted & co. Being able to delete an entire region is powerful (especially with pluggable RegionSplitPolicies - HBASE-5304 ). So generally I'd prefer an implementation that either enforces start-range and end-range to coincide with the a region (although that is awkward), or have logic that splits the two adjacent regions and deletes the whole range by removing a range of regions (including the newly created partial regions).
          Hide
          Nicolas Spiegelberg added a comment -

          I agree with Todd's comments. This problem doesn't seem to be solved at the correct layer. A more natural fit would be to add a range delete marker and optimize the scanner code on compaction. Adding a range delete seems more natural than an explicit region delete & can be used at the client level.

          Show
          Nicolas Spiegelberg added a comment - I agree with Todd's comments. This problem doesn't seem to be solved at the correct layer. A more natural fit would be to add a range delete marker and optimize the scanner code on compaction. Adding a range delete seems more natural than an explicit region delete & can be used at the client level.
          Hide
          Ted Yu added a comment -

          Mubarak did borrow some code from merge()
          We'll continually refactor the code in this patch to reduce overlap.

          If you look at the flow in current patch, there're multiple steps involved. We need to prepare for failure of every step.

          Show
          Ted Yu added a comment - Mubarak did borrow some code from merge() We'll continually refactor the code in this patch to reduce overlap. If you look at the flow in current patch, there're multiple steps involved. We need to prepare for failure of every step.
          Hide
          Todd Lipcon added a comment -

          (btw, I am OK with the API that takes a key range and requires that the key range map to a region initially).

          But the above comments about implementation stand - can this be done in such a way that it shares code with "merge"? i.e that it's a two step process: step 1 is to clear the memstore + delete hfiles in a region (so the region still exists and is empty), and step 2 being that we initiate a merge to kill the now-empty one? The first step would then be RS-local, which makes it simpler, and the second step would cover other use cases as well

          Show
          Todd Lipcon added a comment - (btw, I am OK with the API that takes a key range and requires that the key range map to a region initially). But the above comments about implementation stand - can this be done in such a way that it shares code with "merge"? i.e that it's a two step process: step 1 is to clear the memstore + delete hfiles in a region (so the region still exists and is empty), and step 2 being that we initiate a merge to kill the now-empty one? The first step would then be RS-local, which makes it simpler, and the second step would cover other use cases as well
          Hide
          Todd Lipcon added a comment -

          Hmm, maybe I am a bit slow: in both cases the data of the named region wouldn't exist any more on the original region server.

          But region servers are also not part of the data model.
          There's the implementation (region servers, regions, etc) and the data model (rows, columns, etc). move is an administrative/operational thing (like turning on/off load balancer or gathering metrics). {{delete...} is a data model thing (it affects what's stored). Having APIs that mix the two is messy.

          Show
          Todd Lipcon added a comment - Hmm, maybe I am a bit slow: in both cases the data of the named region wouldn't exist any more on the original region server. But region servers are also not part of the data model . There's the implementation (region servers, regions, etc) and the data model (rows, columns, etc). move is an administrative/operational thing (like turning on/off load balancer or gathering metrics). {{delete...} is a data model thing (it affects what's stored). Having APIs that mix the two is messy.
          Hide
          Ted Yu added a comment -

          Hmm, maybe I am a bit slow: in both cases the data of the named region wouldn't exist any more on the original region server.

          However, in order to move this feature forward, I am fine with changing the API to deleteRange(startKey, endKey).
          In this JIRA, we can check that (startKey, endKey) actually form the boundary of an existing region. We can cover other scenarios in future JIRAs.

          Show
          Ted Yu added a comment - Hmm, maybe I am a bit slow: in both cases the data of the named region wouldn't exist any more on the original region server. However, in order to move this feature forward, I am fine with changing the API to deleteRange(startKey, endKey). In this JIRA, we can check that (startKey, endKey) actually form the boundary of an existing region. We can cover other scenarios in future JIRAs.
          Hide
          Todd Lipcon added a comment -

          The difference is that move is an operational thing affecting load balancing. The data in the database stays the same. deleteRegion leaks the concept of regions into the data model itself, since it affects what's stored.

          Show
          Todd Lipcon added a comment - The difference is that move is an operational thing affecting load balancing. The data in the database stays the same. deleteRegion leaks the concept of regions into the data model itself, since it affects what's stored.
          Hide
          Ted Yu added a comment -

          @Todd:
          In HMasterInterface, there is already API exposing the concept of regions:

            public void move(final byte [] encodedRegionName, final byte [] destServerName)
            throws UnknownRegionException;
          

          In this regard, the new API doesn't expose newer concept:

            public void deleteRegion(List<byte[]> encodedRegionNameList)
          

          The above is, I think, equivalent to deleteRange() you proposed. Maybe the above is better in that we don't need to validate that startKey is the first key of some region, etc.
          I think splitting the underlying region(s) of specified table in case startKey is not the first key is outside the scope of this JIRA.

          For use case, we have an internal team which requests this feature. They can manage to age the data in selected regions such that they can drop whole region(s) quickly.

          Show
          Ted Yu added a comment - @Todd: In HMasterInterface, there is already API exposing the concept of regions: public void move( final byte [] encodedRegionName, final byte [] destServerName) throws UnknownRegionException; In this regard, the new API doesn't expose newer concept: public void deleteRegion(List< byte []> encodedRegionNameList) The above is, I think, equivalent to deleteRange() you proposed. Maybe the above is better in that we don't need to validate that startKey is the first key of some region, etc. I think splitting the underlying region(s) of specified table in case startKey is not the first key is outside the scope of this JIRA. For use case, we have an internal team which requests this feature. They can manage to age the data in selected regions such that they can drop whole region(s) quickly.
          Hide
          Todd Lipcon added a comment -

          Note also the comment in the Lily thread referenced:

          He he, cool. Unfortunately we will not be able to use a HBase feature where you can quickly delete named regions

          Maybe some others can explain their use cases here?

          Show
          Todd Lipcon added a comment - Note also the comment in the Lily thread referenced: He he, cool. Unfortunately we will not be able to use a HBase feature where you can quickly delete named regions Maybe some others can explain their use cases here?
          Hide
          Todd Lipcon added a comment -

          Maybe I'm misunderstanding the use case, but I'm slightly skeptical that there isn't a simpler way of solving the same thing.

          It seems like this conflates two operations: 1) delete a bunch of data within a range, and 2) merge a region back with its neighbors. #2 is a more general "merge small regions" operation, which I think should be attacked orthogonally to the issue of #1 (bulk range delete).
          My other concern is that the API here exposes the concept of regions. A better API would be deleteRange(startKey, endKey). It might be implemented underneath by splitting the table such that startKey is the first key in some region, and endKey is the last key in another, and then deleting the underlying regions. See http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Table_Configuration.html#Delete_Range for example

          Show
          Todd Lipcon added a comment - Maybe I'm misunderstanding the use case, but I'm slightly skeptical that there isn't a simpler way of solving the same thing. It seems like this conflates two operations: 1) delete a bunch of data within a range, and 2) merge a region back with its neighbors. #2 is a more general "merge small regions" operation, which I think should be attacked orthogonally to the issue of #1 (bulk range delete). My other concern is that the API here exposes the concept of regions. A better API would be deleteRange(startKey, endKey). It might be implemented underneath by splitting the table such that startKey is the first key in some region, and endKey is the last key in another, and then deleting the underlying regions. See http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Table_Configuration.html#Delete_Range for example
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12514800/HBASE-4991.trunk.v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -134 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 171 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestShell
          org.apache.hadoop.hbase.mapreduce.TestImportTsv
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/970//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/970//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/970//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514800/HBASE-4991.trunk.v2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -134 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 171 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestShell org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/970//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/970//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/970//console This message is automatically generated.
          Hide
          jiraposter@reviews.apache.org added a comment - - edited

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3917/
          -----------------------------------------------------------

          (Updated 2012-02-15 19:16:37.200644)

          Review request for hbase.

          Changes
          -------

          Removing HBASE-4991 from bugs field. Thanks.

          Summary
          -------

          Code review request for HBASE-4991 (Provide capability to delete named region).

          Diffs


          /src/main/java/org/apache/hadoop/hbase/catalog/MetaEditor.java 1236386
          /src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java 1236386
          /src/main/java/org/apache/hadoop/hbase/ipc/HMasterInterface.java 1236386
          /src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 1236386
          /src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1236386
          /src/main/java/org/apache/hadoop/hbase/master/MasterServices.java 1236386
          /src/main/java/org/apache/hadoop/hbase/master/handler/DeleteRegionHandler.java PRE-CREATION
          /src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 1236386
          /src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1236386
          /src/main/java/org/apache/hadoop/hbase/regionserver/OnlineRegions.java 1236386
          /src/main/java/org/apache/hadoop/hbase/zookeeper/MasterDeleteRegionTracker.java PRE-CREATION
          /src/main/java/org/apache/hadoop/hbase/zookeeper/RegionServerDeleteRegionTracker.java PRE-CREATION
          /src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java 1241350
          /src/main/ruby/hbase/admin.rb 1236386
          /src/main/ruby/shell.rb 1236386
          /src/main/ruby/shell/commands/delete_region.rb PRE-CREATION
          /src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1236386
          /src/test/java/org/apache/hadoop/hbase/client/TestDeleteRegion.java PRE-CREATION
          /src/test/java/org/apache/hadoop/hbase/master/TestCatalogJanitor.java 1236386
          /src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java 1236386

          Diff: https://reviews.apache.org/r/3917/diff

          Testing
          -------

          Unit tests are passed. Tested in 5 node dev cluster.

          Test case: src/test/java/org/apache/hadoop/hbase/client/TestDeleteRegion.java

          Thanks,

          Mubarak

          Show
          jiraposter@reviews.apache.org added a comment - - edited ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3917/ ----------------------------------------------------------- (Updated 2012-02-15 19:16:37.200644) Review request for hbase. Changes ------- Removing HBASE-4991 from bugs field. Thanks. Summary ------- Code review request for HBASE-4991 (Provide capability to delete named region). Diffs /src/main/java/org/apache/hadoop/hbase/catalog/MetaEditor.java 1236386 /src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java 1236386 /src/main/java/org/apache/hadoop/hbase/ipc/HMasterInterface.java 1236386 /src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 1236386 /src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1236386 /src/main/java/org/apache/hadoop/hbase/master/MasterServices.java 1236386 /src/main/java/org/apache/hadoop/hbase/master/handler/DeleteRegionHandler.java PRE-CREATION /src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 1236386 /src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1236386 /src/main/java/org/apache/hadoop/hbase/regionserver/OnlineRegions.java 1236386 /src/main/java/org/apache/hadoop/hbase/zookeeper/MasterDeleteRegionTracker.java PRE-CREATION /src/main/java/org/apache/hadoop/hbase/zookeeper/RegionServerDeleteRegionTracker.java PRE-CREATION /src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java 1241350 /src/main/ruby/hbase/admin.rb 1236386 /src/main/ruby/shell.rb 1236386 /src/main/ruby/shell/commands/delete_region.rb PRE-CREATION /src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1236386 /src/test/java/org/apache/hadoop/hbase/client/TestDeleteRegion.java PRE-CREATION /src/test/java/org/apache/hadoop/hbase/master/TestCatalogJanitor.java 1236386 /src/test/java/org/apache/hadoop/hbase/util/MockRegionServerServices.java 1236386 Diff: https://reviews.apache.org/r/3917/diff Testing ------- Unit tests are passed. Tested in 5 node dev cluster. Test case: src/test/java/org/apache/hadoop/hbase/client/TestDeleteRegion.java Thanks, Mubarak
          Hide
          Mubarak Seyed added a comment -

          Waiting for corporate approval to contribute this patch. Thanks.

          Show
          Mubarak Seyed added a comment - Waiting for corporate approval to contribute this patch. Thanks.
          Hide
          Mubarak Seyed added a comment -

          1. HBaseAdmin needs to have deleteRegion(byte[] encodedRegionName), which invokes getMaster().deleteRegion(encodedRegionName)
          (HMasterInterface will have deleteRegion(byte[] encodedRegionName))

          2. HMaster needs to have deleteRegion(byte[] encodedRegionName).

          • Do we need pre/post CP here?
          • Need DeleteRegionHandler, need to check whether specified region belongs to user regions (shouldn't be ROOT or META), turn off load balancer.

          3. MasterDeleteRegionTracker (new class) will create deleteRegionNode (under <zookeeper.znode.parent>/delete-region) and processing steps are almost like MasterSchemaChangeTracker

          4. DeleteRegionTracker (new class in RS side) will process nodeChildrenChanged(), get the list of regions_to_be_deleted, check that those regions are being hosted by the RS, if yes then

          • doDeleteRegion
            • call deleteRegion() in HRegionServer
              • disable the region split
              • close the region
              • remove from META
              • bridge the whole in META (extending the span of before or after region)
              • remove region directory from HDFS
            • update state in ZK (<zookeeper.znode.parent>/delete-region/<encoded-region-name>)

          5. MasterDeleteRegionTracker gets nodeDataChanged() Zk event, process the new state and clean the znode (if region is deleted successfully) otherwise log the debug message

          I think i am missing some of the failure cases:

          • update of before or after region
          • Master fail-over
          Show
          Mubarak Seyed added a comment - 1. HBaseAdmin needs to have deleteRegion(byte[] encodedRegionName), which invokes getMaster().deleteRegion(encodedRegionName) (HMasterInterface will have deleteRegion(byte[] encodedRegionName)) 2. HMaster needs to have deleteRegion(byte[] encodedRegionName). Do we need pre/post CP here? Need DeleteRegionHandler, need to check whether specified region belongs to user regions (shouldn't be ROOT or META), turn off load balancer. 3. MasterDeleteRegionTracker (new class) will create deleteRegionNode (under <zookeeper.znode.parent>/delete-region) and processing steps are almost like MasterSchemaChangeTracker 4. DeleteRegionTracker (new class in RS side) will process nodeChildrenChanged(), get the list of regions_to_be_deleted, check that those regions are being hosted by the RS, if yes then doDeleteRegion call deleteRegion() in HRegionServer disable the region split close the region remove from META bridge the whole in META (extending the span of before or after region) remove region directory from HDFS update state in ZK (<zookeeper.znode.parent>/delete-region/<encoded-region-name>) 5. MasterDeleteRegionTracker gets nodeDataChanged() Zk event, process the new state and clean the znode (if region is deleted successfully) otherwise log the debug message I think i am missing some of the failure cases: update of before or after region Master fail-over
          Hide
          Ted Yu added a comment -

          I am thinking in the new API we'll be adding we should allow the specification of more than one region to be dropped. Presumably the regions should be adjacent.

          This would be useful when, say, the cron job that calls this new API to clean up obsolete data got interrupted (server down time, etc) and there is a need to catch up with data cleansing.

          Show
          Ted Yu added a comment - I am thinking in the new API we'll be adding we should allow the specification of more than one region to be dropped. Presumably the regions should be adjacent. This would be useful when, say, the cron job that calls this new API to clean up obsolete data got interrupted (server down time, etc) and there is a need to catch up with data cleansing.
          Hide
          Ted Yu added a comment -

          Forgot to mention in my comment @ 27/Jan/12 18:32:
          Load balancing should be disabled at the beginning of operation and enabled at the end of operation.

          Show
          Ted Yu added a comment - Forgot to mention in my comment @ 27/Jan/12 18:32: Load balancing should be disabled at the beginning of operation and enabled at the end of operation.
          Hide
          stack added a comment -

          For successive regions R1, R2 and R3, if we delete R2, we can change the end key of R1 to be the original end key of R2 and drop region R2 directly.

          Changing the end key of R1 to be the end key of R2 will require creating a new region R1'; we'll have to close R1 and R2 and delete both (after moving the R1 data to R1').

          Show
          stack added a comment - For successive regions R1, R2 and R3, if we delete R2, we can change the end key of R1 to be the original end key of R2 and drop region R2 directly. Changing the end key of R1 to be the end key of R2 will require creating a new region R1'; we'll have to close R1 and R2 and delete both (after moving the R1 data to R1').
          Hide
          ramkrishna.s.vasudevan added a comment -

          +1 on feature. We need to take care of race conditions also, just in case if any.

          Show
          ramkrishna.s.vasudevan added a comment - +1 on feature. We need to take care of race conditions also, just in case if any.
          Hide
          Ted Yu added a comment -

          As what is done in HBASE-4213, we should also disable region splitting at the beginning of the operation and re-enable splitting at the end.

          Show
          Ted Yu added a comment - As what is done in HBASE-4213 , we should also disable region splitting at the beginning of the operation and re-enable splitting at the end.
          Hide
          Ted Yu added a comment -

          Considering various failure cases, I think we need to store the state of region deletion progress in zookeeper (under /hbase/fast-delete, e.g.).
          This is needed because we should be prepared that update of R1 in the above case may fail.
          You can find an example in HBASE-4213 where SchemaChangeTracker tracks the state transitions.

          Show
          Ted Yu added a comment - Considering various failure cases, I think we need to store the state of region deletion progress in zookeeper (under /hbase/fast-delete, e.g.). This is needed because we should be prepared that update of R1 in the above case may fail. You can find an example in HBASE-4213 where SchemaChangeTracker tracks the state transitions.
          Hide
          Ted Yu added a comment -

          Good point.
          As I outlined @ 26/Jan/12 23:38, we don't need to create empty region. Therefore we don't merge regions.
          For successive regions R1, R2 and R3, if we delete R2, we can change the end key of R1 to be the original end key of R2 and drop region R2 directly.

          Show
          Ted Yu added a comment - Good point. As I outlined @ 26/Jan/12 23:38, we don't need to create empty region. Therefore we don't merge regions. For successive regions R1, R2 and R3, if we delete R2, we can change the end key of R1 to be the original end key of R2 and drop region R2 directly.
          Hide
          Mubarak Seyed added a comment -

          I think we can make use of

          protected boolean merge(final HRegionInfo[] info) throws IOException {
            if ((currentSize + nextSize) <= (maxFilesize / 2)) {
                    // We merge two adjacent regions if their total size is less than
                    // one half of the desired maximum size
                    LOG.info("Merging regions " + currentRegion.getRegionNameAsString() +
                      " and " + nextRegion.getRegionNameAsString());
                    HRegion mergedRegion =
                      HRegion.mergeAdjacent(currentRegion, nextRegion);
                    updateMeta(currentRegion.getRegionName(), nextRegion.getRegionName(),
                        mergedRegion);
                    break;
            }
          }
          

          what happens if sum(previous_region_size and next_region_size ) > maxFileSize when we try to merge adjacent regions (to bridge the hole)?

          Show
          Mubarak Seyed added a comment - I think we can make use of protected boolean merge( final HRegionInfo[] info) throws IOException { if ((currentSize + nextSize) <= (maxFilesize / 2)) { // We merge two adjacent regions if their total size is less than // one half of the desired maximum size LOG.info( "Merging regions " + currentRegion.getRegionNameAsString() + " and " + nextRegion.getRegionNameAsString()); HRegion mergedRegion = HRegion.mergeAdjacent(currentRegion, nextRegion); updateMeta(currentRegion.getRegionName(), nextRegion.getRegionName(), mergedRegion); break ; } } what happens if sum(previous_region_size and next_region_size ) > maxFileSize when we try to merge adjacent regions (to bridge the hole)?
          Hide
          Ted Yu added a comment -

          BTW OnlineMerger is in src/main/java/org/apache/hadoop/hbase/util/HMerge.java

          I think for this case we don't need to create an empty region because we would end up closing at least two regions. That may increase the downtime for the underlying table.

          Show
          Ted Yu added a comment - BTW OnlineMerger is in src/main/java/org/apache/hadoop/hbase/util/HMerge.java I think for this case we don't need to create an empty region because we would end up closing at least two regions. That may increase the downtime for the underlying table.
          Hide
          Jonathan Hsieh added a comment -

          Oops – wasn't looking at the comment tab.

          There is similar code in OnlineMerge and uber hbck.

          The code in uber hbck creates a new empty region, closes old regions, moves data into the new empty region, and then activates the new now populated region.

          Beware – I found just closing a region seems to have left data around in the HMaster's memory which cause disabling to have problems in the 0.90.x version. I'm in the process of porting to trunk/0.92 currently and am finding out if there are similar or different problems. I think I saw something else in closeRegion recently that I need to try out – don't remember which version that is however.

          Show
          Jonathan Hsieh added a comment - Oops – wasn't looking at the comment tab. There is similar code in OnlineMerge and uber hbck. The code in uber hbck creates a new empty region, closes old regions, moves data into the new empty region, and then activates the new now populated region. Beware – I found just closing a region seems to have left data around in the HMaster's memory which cause disabling to have problems in the 0.90.x version. I'm in the process of porting to trunk/0.92 currently and am finding out if there are similar or different problems. I think I saw something else in closeRegion recently that I need to try out – don't remember which version that is however.
          Hide
          Ted Yu added a comment -

          I think both of them should be done.

          Show
          Ted Yu added a comment - I think both of them should be done.
          Hide
          Jonathan Hsieh added a comment -

          When you are deleting regions, do you intend to just getting rid of all the data in region, or do you mean to create a hole in a region and the merge with an preceding or succeeding region?

          Show
          Jonathan Hsieh added a comment - When you are deleting regions, do you intend to just getting rid of all the data in region, or do you mean to create a hole in a region and the merge with an preceding or succeeding region?
          Hide
          Ted Yu added a comment -

          The above syntax makes sense.

          Show
          Ted Yu added a comment - The above syntax makes sense.
          Hide
          Mubarak Seyed added a comment -

          Do we need add a command under tools in hbase shell (with public API for delete named region)?

          How about this?

          hbase(main)> delete_region <region_name>

          compact and major_compact supports region-name as an argument, can we use the same approach? Thanks.

          Show
          Mubarak Seyed added a comment - Do we need add a command under tools in hbase shell (with public API for delete named region)? How about this? hbase(main)> delete_region <region_name> compact and major_compact supports region-name as an argument, can we use the same approach? Thanks.
          Hide
          Ted Yu added a comment -

          One of our teams is asking for this feature as well.

          Show
          Ted Yu added a comment - One of our teams is asking for this feature as well.
          Hide
          stack added a comment -

          Yeah, we need this. Should be easy enough... close region out on regionserver, remove from .META., bridge the hole in .META. by extending the span of the before or after regions, then finally removing dir from hdfs.

          Show
          stack added a comment - Yeah, we need this. Should be easy enough... close region out on regionserver, remove from .META., bridge the hole in .META. by extending the span of the before or after regions, then finally removing dir from hdfs.

            People

            • Assignee:
              Mubarak Seyed
              Reporter:
              Ted Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development