HBase
  1. HBase
  2. HBASE-3833

ability to support includes/excludes list in Hbase

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.90.2
    • Fix Version/s: None
    • Component/s: Client, regionserver
    • Labels:
      None
    • Release Note:
      Hide
      This change adds support for includes and excludes list in HBase. To use it, specify the following properties in the hbase configuration file:
      hbase.hosts - for the includes file
      hbase.hosts.exclude - for the excludes file

      Both the files are optional. If the includes file is missing, any node is allowed (except the ones in the exclude file)

      If the includes file is present, then a node has to be present in the includes file and absent from the excludes file, to be able to successfully join.

      These files can be changed when the server is running. After changing these files, you can use the hbase console to issue 'refresh_nodes' command for the master to pick up the new config.

      If an online region server is added to the excludes file, 'refresh_nodes' will kick it out, with a specific exception. When the region server gets that exception, it shuts itself down.
      Show
      This change adds support for includes and excludes list in HBase. To use it, specify the following properties in the hbase configuration file: hbase.hosts - for the includes file hbase.hosts.exclude - for the excludes file Both the files are optional. If the includes file is missing, any node is allowed (except the ones in the exclude file) If the includes file is present, then a node has to be present in the includes file and absent from the excludes file, to be able to successfully join. These files can be changed when the server is running. After changing these files, you can use the hbase console to issue 'refresh_nodes' command for the master to pick up the new config. If an online region server is added to the excludes file, 'refresh_nodes' will kick it out, with a specific exception. When the region server gets that exception, it shuts itself down.

      Description

      An HBase cluster currently does not have the ability to specify that the master should accept regionservers only from a specified list. This helps preventing administrative errors where the same machine could be included in two clusters. It also allows the administrator to easily remove un-ssh-able machines from the cluster.

      1. ASF.LICENSE.NOT.GRANTED--excl-patch.txt
        28 kB
        Vishal Kathuria
      2. excl-patch.txt
        18 kB
        Vishal Kathuria

        Issue Links

          Activity

          Hide
          Vishal Kathuria added a comment -

          I fell sick last week and then got busy with a conference. I should be getting back to this soon.

          Show
          Vishal Kathuria added a comment - I fell sick last week and then got busy with a conference. I should be getting back to this soon.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          dhruba borthakur added a comment -

          Not ready yet.

          Show
          dhruba borthakur added a comment - Not ready yet.
          Hide
          Vishal Kathuria added a comment -

          @Stack: I started on another project, so didn't get a chance to put the patch together. I have been meaning to get back to it. Will try to shoot for it this week.

          Thanks!

          Show
          Vishal Kathuria added a comment - @Stack: I started on another project, so didn't get a chance to put the patch together. I have been meaning to get back to it. Will try to shoot for it this week. Thanks!
          Hide
          stack added a comment -

          @Vishal Any luck w/ hacking up another patch?

          Show
          stack added a comment - @Vishal Any luck w/ hacking up another patch?
          Todd Lipcon made changes -
          Link This issue is related to HBASE-4298 [ HBASE-4298 ]
          Hide
          Vishal Kathuria added a comment -

          Stack:
          My intention was build a graceful shutdown (that is the reason I assign the regions first and then expire the server). I assumed that assign* API unloads them off the source region server when assigning them to the new region server, but I guess I was wrong in that assumption. Let me take a look at the code again to see how I can change it to make the shutdown graceful and not have the situation you mentioned of two servers owning the region - that is clearly not desirable.

          Yuzhi:
          In case any of the meta regions are involved, then the MetaServerShutdownHandler is dispatched. That class has both isCarryingRoot() and isCarryingMeta() defined.

          I will publish another patch in a couple of days. Stack, please hold on to hacking this in.

          Thanks!
          Vishal

          Show
          Vishal Kathuria added a comment - Stack: My intention was build a graceful shutdown (that is the reason I assign the regions first and then expire the server). I assumed that assign* API unloads them off the source region server when assigning them to the new region server, but I guess I was wrong in that assumption. Let me take a look at the code again to see how I can change it to make the shutdown graceful and not have the situation you mentioned of two servers owning the region - that is clearly not desirable. Yuzhi: In case any of the meta regions are involved, then the MetaServerShutdownHandler is dispatched. That class has both isCarryingRoot() and isCarryingMeta() defined. I will publish another patch in a couple of days. Stack, please hold on to hacking this in. Thanks! Vishal
          Hide
          dhruba borthakur added a comment -

          It will be nice to get both "forced" and "graceful" decommissioning to this jira, is Vishal agrees.

          Show
          dhruba borthakur added a comment - It will be nice to get both "forced" and "graceful" decommissioning to this jira, is Vishal agrees.
          Hide
          stack added a comment -

          In short, maybe we need a "force" option in decommissioning, which if used, does not wait for a graceful shutdown of the specified regionserver, instead declares it dead immediately and then follows the normal course of action (lease recovery, reassign regions, etc)

          Agreed. Seems like this patch does the 'forced' option. Missing from this patch (though available as a script) is the graceful decommission. Vishal, were you going for the 'forced' option only? If so, should we commit this patch as is (though I think it dangerous if misunderstood and user is not clear that excludes means 'forced' decommission).

          Show
          stack added a comment - In short, maybe we need a "force" option in decommissioning, which if used, does not wait for a graceful shutdown of the specified regionserver, instead declares it dead immediately and then follows the normal course of action (lease recovery, reassign regions, etc) Agreed. Seems like this patch does the 'forced' option. Missing from this patch (though available as a script) is the graceful decommission. Vishal, were you going for the 'forced' option only? If so, should we commit this patch as is (though I think it dangerous if misunderstood and user is not clear that excludes means 'forced' decommission).
          Hide
          dhruba borthakur added a comment -

          There are actually two use-cases that triggered this JIRA.

          1. There are times when the adminstrator wants to shut down a few region servers, usually to upgrade hardware or some such reason. In this case, the administrator can put these machines in the excludes list, wait for those region servers to gracefully shutdown.

          2. The other use-case is when certain region servers become unresponsive. Twice it so happened that a regionserver is heartbeating with ZK, but its capability to process hbase workload suddenly fell to almost zero. We could not ssh into the machine to debug what is wrong. (The suspicion is that the machine started swapping). In this case, it would be nice if the administrator had an option to put a machine in the excludes list, wait for a few minutes for it to gracefully exit, but if it still does not exit, then forcefully declare the regionserver as "dead".

          In short, maybe we need a "force" option in decommissioning, which if used, does not wait for a graceful shutdown of the specified regionserver, instead declares it dead immediately and then follows the normal course of action (lease recovery, reassign regions, etc)

          Show
          dhruba borthakur added a comment - There are actually two use-cases that triggered this JIRA. 1. There are times when the adminstrator wants to shut down a few region servers, usually to upgrade hardware or some such reason. In this case, the administrator can put these machines in the excludes list, wait for those region servers to gracefully shutdown. 2. The other use-case is when certain region servers become unresponsive. Twice it so happened that a regionserver is heartbeating with ZK, but its capability to process hbase workload suddenly fell to almost zero. We could not ssh into the machine to debug what is wrong. (The suspicion is that the machine started swapping). In this case, it would be nice if the administrator had an option to put a machine in the excludes list, wait for a few minutes for it to gracefully exit, but if it still does not exit, then forcefully declare the regionserver as "dead". In short, maybe we need a "force" option in decommissioning, which if used, does not wait for a graceful shutdown of the specified regionserver, instead declares it dead immediately and then follows the normal course of action (lease recovery, reassign regions, etc)
          Hide
          Ted Yu added a comment -

          Looking through ServerShutdownHandler, I found this code in process():

              // Assign root and meta if we were carrying them.
              if (isCarryingRoot()) { // -ROOT-
          

          But:

            boolean isCarryingRoot() {
              return false;
            }
          

          Same with

              if (isCarryingMeta()) this.services.getAssignmentManager().assignMeta();
          
          Show
          Ted Yu added a comment - Looking through ServerShutdownHandler, I found this code in process(): // Assign root and meta if we were carrying them. if (isCarryingRoot()) { // -ROOT- But: boolean isCarryingRoot() { return false ; } Same with if (isCarryingMeta()) this .services.getAssignmentManager().assignMeta();
          Hide
          stack added a comment -

          2. enqueues a shutdownhandler to move all the regions off this node

          ServerShutdownHandler splits logs and reassigns regions that were (or are in the decommissioning case) up on the shutdown server. It does not unload them off the hosting regionserver. So, it looks to me as though regions can be in two places at once... still on the server that has been marked decommissioned and up in the new locations that ServerShutdownHandler has assigned them too. This looks a little dangerous (if I am reading this right).

          If decommissioning, it looks like you expire the server only at the end of the server shutdown processing (your call to markServerExpiry will be noticed by the hosting regionserver and it'll shut itself down closing out the regions it was hosting).

          New JIRA for Karthiks suggestion sounds good.

          Show
          stack added a comment - 2. enqueues a shutdownhandler to move all the regions off this node ServerShutdownHandler splits logs and reassigns regions that were (or are in the decommissioning case) up on the shutdown server. It does not unload them off the hosting regionserver. So, it looks to me as though regions can be in two places at once... still on the server that has been marked decommissioned and up in the new locations that ServerShutdownHandler has assigned them too. This looks a little dangerous (if I am reading this right). If decommissioning, it looks like you expire the server only at the end of the server shutdown processing (your call to markServerExpiry will be noticed by the hosting regionserver and it'll shut itself down closing out the regions it was hosting). New JIRA for Karthiks suggestion sounds good.
          Hide
          Vishal Kathuria added a comment -

          Hi stack,

          Here is some more info on the design
          ServerManager::decommissionServer(serverInfo) f
          1. first puts the node in decommissioned list so no new regions get placed on this node. decommissioned list prevents a node with that name from joining in but doesn't kick the node out if it is already joined in.
          2. enqueues a shutdownhandler to move all the regions off this node
          3. once the shutdown handler is done doing that, then it marks the node dead. It is only at this point that the region server gets a YouAreDeadException and shuts itself down.

          So this is supposed to do gentle offload of regions. That's why I had to put a wait in my test because the gentle offload can take some time.
          I did check in the debug logs that the regions are offloaded before the server gets shut down - so its working for me. Please go ahead and hack this in

          Re Karthik's suggestion of timeout, I'll open a separate task for it. Off the top of my head, I can't think of a simple way of implementing it

          Thanks a lot for reviewing and committing this change!
          Vishal

          Show
          Vishal Kathuria added a comment - Hi stack, Here is some more info on the design ServerManager::decommissionServer(serverInfo) f 1. first puts the node in decommissioned list so no new regions get placed on this node. decommissioned list prevents a node with that name from joining in but doesn't kick the node out if it is already joined in. 2. enqueues a shutdownhandler to move all the regions off this node 3. once the shutdown handler is done doing that, then it marks the node dead. It is only at this point that the region server gets a YouAreDeadException and shuts itself down. So this is supposed to do gentle offload of regions. That's why I had to put a wait in my test because the gentle offload can take some time. I did check in the debug logs that the regions are offloaded before the server gets shut down - so its working for me. Please go ahead and hack this in Re Karthik's suggestion of timeout, I'll open a separate task for it. Off the top of my head, I can't think of a simple way of implementing it Thanks a lot for reviewing and committing this change! Vishal
          Hide
          stack added a comment -

          Regarding destination for the patch - I'll follow up with dhruba to figure out the desired destination and make appropriate changes to remove the deprecated stuff.

          NP. Looks like this patch is for branch. Thats fine. I can hack it into TRUNK for you if needs be.

          ...ServerManager::refreshNodes() calls ServerManager::decommissionServer(serverInfo)...

          I see. So we just crash the server down and let the shutdown process do the reassign of regions. I was looking for gentle offload of regions. Thats not here. Thats fine. I just thought it was.

          Let me know if you are going to update patch. If you are, I'll wait, else will try hack this in (if its working for you!)

          Show
          stack added a comment - Regarding destination for the patch - I'll follow up with dhruba to figure out the desired destination and make appropriate changes to remove the deprecated stuff. NP. Looks like this patch is for branch. Thats fine. I can hack it into TRUNK for you if needs be. ...ServerManager::refreshNodes() calls ServerManager::decommissionServer(serverInfo)... I see. So we just crash the server down and let the shutdown process do the reassign of regions. I was looking for gentle offload of regions. Thats not here. Thats fine. I just thought it was. Let me know if you are going to update patch. If you are, I'll wait, else will try hack this in (if its working for you!)
          Hide
          Vishal Kathuria added a comment -

          Thanks for the review and the suggestions stack.

          I didn't know about the JUNit test timeout - you are right, I don't need to do limit the loop.

          Regarding destination for the patch - I'll follow up with dhruba to figure out the desired destination and make appropriate changes to remove the deprecated stuff.

          "I'm not clear on how a decommissioned node, once added to the excludes, is cleared of its regions. Did I miss it?"
          When the node list is refreshed and a node shows up in the decommissioned list, ServerManager::refreshNodes() calls ServerManager::decommissionServer(serverInfo) to start the decomissioning process.

          Thanks!
          Vishal

          Show
          Vishal Kathuria added a comment - Thanks for the review and the suggestions stack. I didn't know about the JUNit test timeout - you are right, I don't need to do limit the loop. Regarding destination for the patch - I'll follow up with dhruba to figure out the desired destination and make appropriate changes to remove the deprecated stuff. "I'm not clear on how a decommissioned node, once added to the excludes, is cleared of its regions. Did I miss it?" When the node list is refreshed and a node shows up in the decommissioned list, ServerManager::refreshNodes() calls ServerManager::decommissionServer(serverInfo) to start the decomissioning process. Thanks! Vishal
          Hide
          stack added a comment -

          On timeout and then kill, if Karthik suggested it, then it must be a good idea (smile).

          No biggie but FYI and for next time, style is to put spaces around operators; not "(int i=0;i<3;i+)" but "(int i = 0; i < 3; i+)"

          This is bad, no?

          +    try {
          +      BufferedWriter out = new BufferedWriter(new FileWriter(exFilePath));
          +      out.write(info.getHostnamePort());
          +      out.close();
          +    } catch (IOException e) {
          +    }
          

          If you fail to write the excludes file, the test will fail anyways? Just let out the IOE?

          Here:

          +    for (int i=0;i<3;i++) {
          +    	Thread.sleep(2000);
          +    	if (!cluster.getMaster().getServerManager().getOnlineServersList().contains(info));
          +    	break;
          +    }
          

          You already have a three minute timeout on the test so this wait of six seconds is probably not needed... just loop till it happens OR until junit intercedes and kills the thing because its going on too long.

          Again, none of the above are that important... more for the next time.

          Just to note that NodeDecommissionedException is very close to the YouAreDeadException. Even so, I think it a good idea to add it.

          This patch is for TRUNK? HServerInfo is deprecated. Do you want to use that?

          Nice. You added refreshNodes to the shell.

          I'm not clear on how a decommissioned node, once added to the excludes, is cleared of its regions. Did I miss it?

          Thanks Vishal.

          Show
          stack added a comment - On timeout and then kill, if Karthik suggested it, then it must be a good idea (smile). No biggie but FYI and for next time, style is to put spaces around operators; not "(int i=0;i<3;i+ )" but "(int i = 0; i < 3; i +)" This is bad, no? + try { + BufferedWriter out = new BufferedWriter( new FileWriter(exFilePath)); + out.write(info.getHostnamePort()); + out.close(); + } catch (IOException e) { + } If you fail to write the excludes file, the test will fail anyways? Just let out the IOE? Here: + for ( int i=0;i<3;i++) { + Thread .sleep(2000); + if (!cluster.getMaster().getServerManager().getOnlineServersList().contains(info)); + break ; + } You already have a three minute timeout on the test so this wait of six seconds is probably not needed... just loop till it happens OR until junit intercedes and kills the thing because its going on too long. Again, none of the above are that important... more for the next time. Just to note that NodeDecommissionedException is very close to the YouAreDeadException. Even so, I think it a good idea to add it. This patch is for TRUNK? HServerInfo is deprecated. Do you want to use that? Nice. You added refreshNodes to the shell. I'm not clear on how a decommissioned node, once added to the excludes, is cleared of its regions. Did I miss it? Thanks Vishal.
          Hide
          Vishal Kathuria added a comment -

          Sorry for the radio silence on this issue folks. I have updated the patch with the comments. The biggest change is that now we wait for the region servers to be reassigned before adding the node to expire server list (which will trigger node shutdown).

          Please let me know if you have any further comments.

          Also, let me know if we need to have some sort of timeout and if the region servers aren't migrated within that timeout, then we kill the region server anyway. I got this suggestion from Karthik, but wanted to see what other folks think before implementing it.

          Show
          Vishal Kathuria added a comment - Sorry for the radio silence on this issue folks. I have updated the patch with the comments. The biggest change is that now we wait for the region servers to be reassigned before adding the node to expire server list (which will trigger node shutdown). Please let me know if you have any further comments. Also, let me know if we need to have some sort of timeout and if the region servers aren't migrated within that timeout, then we kill the region server anyway. I got this suggestion from Karthik, but wanted to see what other folks think before implementing it.
          Vishal Kathuria made changes -
          Attachment excl-patch.txt [ 12479496 ]
          Hide
          stack added a comment -

          It was done in ruby because I wanted a graceful_restart that worked with the shipping version of hbase, I did not want to require users update to get a working decommission. The ruby script will be discarded when hbase can do what it does natively. Its a stop gap.

          Don't call out to the ruby script (smile). On the other hand, it has had some exercise and has had some experience frozen into it so is worth a little bit of study if reimplementing up in the master.

          +1 on your 1. and 2. I would encourage you to add bits of general functionality that can be exploited by master or others in future silly ruby scripts.

          Good on you Vishal

          Show
          stack added a comment - It was done in ruby because I wanted a graceful_restart that worked with the shipping version of hbase, I did not want to require users update to get a working decommission. The ruby script will be discarded when hbase can do what it does natively. Its a stop gap. Don't call out to the ruby script (smile). On the other hand, it has had some exercise and has had some experience frozen into it so is worth a little bit of study if reimplementing up in the master. +1 on your 1. and 2. I would encourage you to add bits of general functionality that can be exploited by master or others in future silly ruby scripts. Good on you Vishal
          Hide
          Vishal Kathuria added a comment -

          stack: A quick question. I looked at region_mover.rb and I was wondering why this was implemented in ruby as a script and not in the master. Since I do the parsing of the decom file in the master, I would need to call out from master to the ruby script, which doesn't sound like such a good idea. Secondly, if we improve the load balancing policy from just random to something like load based, the ruby script would not be able to take advantage of it since random is hard coded in it.

          Here is what I think we should ideally do:
          1. Refactor the code in AssignmentManager/ServerShutdown handler to expose a function that takes all the regions off a node.
          2. Use the above function for decommissioning, failure handling and the unload admin commands so we get similar behavior and single implementation across the three.

          What do you guys think?

          dhruba:
          Agreed: I will leave the configs as is for this jira. I assume the config files need to live on some NFS share so they are available when the master fails over - is that so?

          Thanks

          Show
          Vishal Kathuria added a comment - stack: A quick question. I looked at region_mover.rb and I was wondering why this was implemented in ruby as a script and not in the master. Since I do the parsing of the decom file in the master, I would need to call out from master to the ruby script, which doesn't sound like such a good idea. Secondly, if we improve the load balancing policy from just random to something like load based, the ruby script would not be able to take advantage of it since random is hard coded in it. Here is what I think we should ideally do: 1. Refactor the code in AssignmentManager/ServerShutdown handler to expose a function that takes all the regions off a node. 2. Use the above function for decommissioning, failure handling and the unload admin commands so we get similar behavior and single implementation across the three. What do you guys think? dhruba: Agreed: I will leave the configs as is for this jira. I assume the config files need to live on some NFS share so they are available when the master fails over - is that so? Thanks
          Hide
          dhruba borthakur added a comment -

          I meant "let's not move any of these configs to zk".

          Show
          dhruba borthakur added a comment - I meant "let's not move any of these configs to zk".
          Hide
          dhruba borthakur added a comment -

          agree with stack. let's nove any of these configs to zk (as part of this jira).

          Show
          dhruba borthakur added a comment - agree with stack. let's nove any of these configs to zk (as part of this jira).
          Hide
          stack added a comment -

          2. I would like to hear from folks regarding what they think of storing these files in zk. Heck, why shouldn't we store the entire config in zk and the only config you need is the address of the zk cluster?

          Yeah, long time plan is to put all config. up in zk. Would suggest though that that is out of scope for this issue and that if your aim is to have this feature work like hadoop's, you not move the includes/excludes there, at least, not as part of this issue.

          Otherwise all else is good. Good on you Vishal.

          Show
          stack added a comment - 2. I would like to hear from folks regarding what they think of storing these files in zk. Heck, why shouldn't we store the entire config in zk and the only config you need is the address of the zk cluster? Yeah, long time plan is to put all config. up in zk. Would suggest though that that is out of scope for this issue and that if your aim is to have this feature work like hadoop's, you not move the includes/excludes there, at least, not as part of this issue. Otherwise all else is good. Good on you Vishal.
          Hide
          Vishal Kathuria added a comment -

          Thank you for the comments folks. I wasn't aware of HDFS-1547 - I just looked through it.

          Here are my thoughts
          1. Agree with doing a graceful shutdown of the region server that is being excluded.
          2. I would like to hear from folks regarding what they think of storing these files in zk. Heck, why shouldn't we store the entire config in zk and the only config you need is the address of the zk cluster?
          3. Regarding the name, I will change the file to decom instead of exclude for consistency with HDFS. One difference with 1547 is that 1547 leaves those nodes up and I am shutting them down. I don't see the value of keeping those nodes up so I will stay with that unless you folks think otherwise.
          4. Re Todd's suggestion on the truth table, is it okay if I publish the boolean expression in the documentation. I have an easier time inferring the intuition from a boolean expression than a truth table.
          5. stack: I will update the licenses to 2011 and look at the spaces again. My intent was to use 2 but I may have missed some places.
          6. The change in .sh file is intentional, otherwise the script for starting the local region server was not working. I need to make sure the change doesn't break any of the other scripts. I will verify that.

          Thanks again folks.
          Vishal

          Show
          Vishal Kathuria added a comment - Thank you for the comments folks. I wasn't aware of HDFS-1547 - I just looked through it. Here are my thoughts 1. Agree with doing a graceful shutdown of the region server that is being excluded. 2. I would like to hear from folks regarding what they think of storing these files in zk. Heck, why shouldn't we store the entire config in zk and the only config you need is the address of the zk cluster? 3. Regarding the name, I will change the file to decom instead of exclude for consistency with HDFS. One difference with 1547 is that 1547 leaves those nodes up and I am shutting them down. I don't see the value of keeping those nodes up so I will stay with that unless you folks think otherwise. 4. Re Todd's suggestion on the truth table, is it okay if I publish the boolean expression in the documentation. I have an easier time inferring the intuition from a boolean expression than a truth table. 5. stack: I will update the licenses to 2011 and look at the spaces again. My intent was to use 2 but I may have missed some places. 6. The change in .sh file is intentional, otherwise the script for starting the local region server was not working. I need to make sure the change doesn't break any of the other scripts. I will verify that. Thanks again folks. Vishal
          Hide
          stack added a comment -

          There is a 'Configuration' section in the book: http://hbase.apache.org/book/configuration.html Vishal, you can add doc. there or just dump a paragraph or two in here and on commit we'll port it over. Thanks.

          Show
          stack added a comment - There is a 'Configuration' section in the book: http://hbase.apache.org/book/configuration.html Vishal, you can add doc. there or just dump a paragraph or two in here and on commit we'll port it over. Thanks.
          Hide
          dhruba borthakur added a comment -

          Thanks Todd. I agree that Vishal can help in filling up the table. Is there a hbase admin guide that needs to be updated too?

          Show
          dhruba borthakur added a comment - Thanks Todd. I agree that Vishal can help in filling up the table. Is there a hbase admin guide that needs to be updated too?
          Hide
          Todd Lipcon added a comment -

          I'm OK with the separate files approach as long as we have clear documentation about what the different combinations mean. In the table above there are a few obvious cells and a lot that are unclear.

          Show
          Todd Lipcon added a comment - I'm OK with the separate files approach as long as we have clear documentation about what the different combinations mean. In the table above there are a few obvious cells and a lot that are unclear.
          Hide
          dhruba borthakur added a comment -

          Patch looks good at first sight. I would like the regions to close elegantly, so that log replay is not required when those regions are instantiated on other regionservers. If you can do that enhancement, that will be great.

          I am fine with the includes/excludes/decommission list (instead of Todd's regex solution), especially because HBase administrators are usually the same as the hdfs administrators; unless Todd feels strongly about it.

          Show
          dhruba borthakur added a comment - Patch looks good at first sight. I would like the regions to close elegantly, so that log replay is not required when those regions are instantiated on other regionservers. If you can do that enhancement, that will be great. I am fine with the includes/excludes/decommission list (instead of Todd's regex solution), especially because HBase administrators are usually the same as the hdfs administrators; unless Todd feels strongly about it.
          Hide
          stack added a comment -

          Vishal: What Todd says. Here's some comments on the patch itself. Licenses should be 2011 FYI. Tabs are two spaces in hbase/hadoop (You seem to have more). It looks like you are crashing out the regionserver and on master running the expireServer. That means that the ShutdownServer handler runs. It'll take care of assigning out the regions that were on the crashed server but because this is a crash-out, we'll also be playing the regionserver's logs. Do you want to add facility for having a regionserver shed its regions and then close clean so there are no regionserver WAL logs to replay? (Its ok if you don't, just suggesting).

          This is intentional?

          -        $command $startStop "$@" > "$logout" 2>&1 < /dev/null &
          +        $command "$@" $startStop > "$logout" 2>&1 < /dev/null &
          

          Otherwise patch looks good Vishal.

          Show
          stack added a comment - Vishal: What Todd says. Here's some comments on the patch itself. Licenses should be 2011 FYI. Tabs are two spaces in hbase/hadoop (You seem to have more). It looks like you are crashing out the regionserver and on master running the expireServer. That means that the ShutdownServer handler runs. It'll take care of assigning out the regions that were on the crashed server but because this is a crash-out, we'll also be playing the regionserver's logs. Do you want to add facility for having a regionserver shed its regions and then close clean so there are no regionserver WAL logs to replay? (Its ok if you don't, just suggesting). This is intentional? - $command $startStop "$@" > "$logout" 2>&1 < /dev/ null & + $command "$@" $startStop > "$logout" 2>&1 < /dev/ null & Otherwise patch looks good Vishal.
          Hide
          Todd Lipcon added a comment -

          I find these include/exclude files just as confusing here in HBase as they've always been in Hadoop. Some users are less polite and call them "insane"

          Could you please do a favor and fill in the following table?

          Include present Exclude present In include In exclude Result at startup Result after refresh
          o o o o ? ?
          o x o o ? ?
          o x o x ? ?
          x o o o ? ?
          x o x o ? ?
          x x o o ? ?
          x x o x ? ?
          x x x o ? ?
          x x x x ? ?

          Rather than this very complicated structure, how about a simpler format, something like:

          host.rules:

          <regex> (allow|deny)
          ...
          

          where we ship a default config that looks like:

          .* allow
          
          Show
          Todd Lipcon added a comment - I find these include/exclude files just as confusing here in HBase as they've always been in Hadoop. Some users are less polite and call them "insane" Could you please do a favor and fill in the following table? Include present Exclude present In include In exclude Result at startup Result after refresh o o o o ? ? o x o o ? ? o x o x ? ? x o o o ? ? x o x o ? ? x x o o ? ? x x o x ? ? x x x o ? ? x x x x ? ? Rather than this very complicated structure, how about a simpler format, something like: host.rules: <regex> (allow|deny) ... where we ship a default config that looks like: .* allow
          Vishal Kathuria made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note This change adds support for includes and excludes list in HBase. To use it, specify the following properties in the hbase configuration file:
          hbase.hosts - for the includes file
          hbase.hosts.exclude - for the excludes file

          Both the files are optional. If the includes file is missing, any node is allowed (except the ones in the exclude file)

          If the includes file is present, then a node has to be present in the includes file and absent from the excludes file, to be able to successfully join.

          These files can be changed when the server is running. After changing these files, you can use the hbase console to issue 'refresh_nodes' command for the master to pick up the new config.

          If an online region server is added to the excludes file, 'refresh_nodes' will kick it out, with a specific exception. When the region server gets that exception, it shuts itself down.
          Affects Version/s 0.90.2 [ 12316152 ]
          Hide
          Vishal Kathuria added a comment -

          This change also includes the unit tests, which tests the key scenario of adding an online region server to the excludes file and then refreshing the excludes file.

          Show
          Vishal Kathuria added a comment - This change also includes the unit tests, which tests the key scenario of adding an online region server to the excludes file and then refreshing the excludes file.
          Vishal Kathuria made changes -
          Field Original Value New Value
          Attachment excl-patch.txt [ 12478028 ]
          Hide
          Vishal Kathuria added a comment -

          Patch for the fix.

          Show
          Vishal Kathuria added a comment - Patch for the fix.
          Hide
          dhruba borthakur added a comment -

          +1 to Todd's idea of similarity with HDFS-1547

          Show
          dhruba borthakur added a comment - +1 to Todd's idea of similarity with HDFS-1547
          Hide
          Todd Lipcon added a comment -

          If going for "same semantics as Hadoop", we should try to match the semantics of HDFS-1547 which adds a new "decom" file, rather than include/exclude. Most ops people I've talked to find the include/exclude files super confusing (what does it mean to be both included and excluded, for example?)

          Show
          Todd Lipcon added a comment - If going for "same semantics as Hadoop", we should try to match the semantics of HDFS-1547 which adds a new "decom" file, rather than include/exclude. Most ops people I've talked to find the include/exclude files super confusing (what does it mean to be both included and excluded, for example?)
          Hide
          Jean-Daniel Cryans added a comment -

          @Vishal, currently doing stop regionserver is probably not what you want to do in a cluster serving live requests (it closes all connections then closes the regions, meaning regions unavailable for a while). You'd want something more like bin/graceful_stop.sh

          @Dhruba, I think it would be cool to have that in ZK, but I also like having the same semantics as Hadoop. So I'm 0.

          Show
          Jean-Daniel Cryans added a comment - @Vishal, currently doing stop regionserver is probably not what you want to do in a cluster serving live requests (it closes all connections then closes the regions, meaning regions unavailable for a while). You'd want something more like bin/graceful_stop.sh @Dhruba, I think it would be cool to have that in ZK, but I also like having the same semantics as Hadoop. So I'm 0.
          Hide
          dhruba borthakur added a comment -

          another option would be to maintain the includes-list and excludes-list in zk itself. Instead of local files on the master, they could be zk nodes that contain a list of machine-names. what do people think about that?

          Show
          dhruba borthakur added a comment - another option would be to maintain the includes-list and excludes-list in zk itself. Instead of local files on the master, they could be zk nodes that contain a list of machine-names. what do people think about that?
          Hide
          Vishal Kathuria added a comment -

          Hi,
          I have started implementing this and below is a mini-spec. Could you pls look through it and let me know if you have any comments.

          Functionality
          1. The semantics of the include and exclude lists would be identical to Hadoop. If the include list is empty, then universal include set is assumed. exclude list takes precedence over include list.
          2. Adding a node to exclude list + issuing a refresh using the command line utility will trigger ServerManager::expireServer for an already online server. This is the same code path that happens when the node goes down.
          3. Nodes in excludeList (or the nodes which are not in include list) will not be allowed to join.
          4. The list will support hostname:port type of entries, where port is optional. Even though in production we wouldn't have multiple regionservers per node, I am doing this to make testing easier on dev servers.

          Tools
          For the command line utility, there will be a new command to refresh server list that will be implemented in HBaseAdmin.
          HBaseAdmin will call into a method that will be added to HMaster and HMasterInterface. As mentioned above, if a node is added to excludeList or is removed from include list, it will be expired. If a node is added to the includeList, no action will be taken.

          The command to refresh server list will be exposed in the hbase shell as well.

          Nicolas: 'bin/hbase-daemon.sh stop regionserver' kills the region server process. I am proposing that we don't try to do that. In the scenario that Dhruba mentioned to me, the regionserver node might be sick and we may not be able to kill it.

          Dhruba: Other than the command line utility to refresh the server list, are there any other tools that we need to provide?

          Thanks!

          Show
          Vishal Kathuria added a comment - Hi, I have started implementing this and below is a mini-spec. Could you pls look through it and let me know if you have any comments. Functionality 1. The semantics of the include and exclude lists would be identical to Hadoop. If the include list is empty, then universal include set is assumed. exclude list takes precedence over include list. 2. Adding a node to exclude list + issuing a refresh using the command line utility will trigger ServerManager::expireServer for an already online server. This is the same code path that happens when the node goes down. 3. Nodes in excludeList (or the nodes which are not in include list) will not be allowed to join. 4. The list will support hostname:port type of entries, where port is optional. Even though in production we wouldn't have multiple regionservers per node, I am doing this to make testing easier on dev servers. Tools For the command line utility, there will be a new command to refresh server list that will be implemented in HBaseAdmin. HBaseAdmin will call into a method that will be added to HMaster and HMasterInterface. As mentioned above, if a node is added to excludeList or is removed from include list, it will be expired. If a node is added to the includeList, no action will be taken. The command to refresh server list will be exposed in the hbase shell as well. Nicolas: 'bin/hbase-daemon.sh stop regionserver' kills the region server process. I am proposing that we don't try to do that. In the scenario that Dhruba mentioned to me, the regionserver node might be sick and we may not be able to kill it. Dhruba: Other than the command line utility to refresh the server list, are there any other tools that we need to provide? Thanks!
          Hide
          Vishal Kathuria added a comment -

          Copying comment history from facebook internal task:

          Dhruba Borthakur - at 3:09pm on April 21st
          andrew/matthew: do we need a discussion on how this feature needs to be designed?

          1. There would be two files includes and excludes.
          2. If a machine is listed in excludes file, it shuts down the regionserver on that machine.
          3. Region servers should start only on machines that are listed in the includes file.
          4. a command line utility to modify/update the excludes/includes file.

          Andrew Ryan - at 3:15pm on April 21st
          We've wanted parity from the beginning. So yes, online decommissioning should be supported. I don't see any reason to do otherwise.

          To Vishal's question, we wouldn't allow the node to join and migrate the regions. We just wouldn't allow it to join. Just shut the regionserver down. But, if a regionserver has active regions when it gets the decommission request, then it should gracefully reassign its regions and then shut down.

          Nicolas Spiegelberg - at 3:20pm on April 21st I think we just want to deny requests and force a regionserver already online to close. HBase just has a logical association with regions that is given upon onlining, stored in an HDFS file called META. We save the state across a graceful restart, but will assign out those regions after a small startup period has passed without onlining. Therefore, we don't benefit from trying to grab any state from the physical server itself.

          Vishal Kathuria - at 3:28pm on April 21st Thanks guys.

          One question is:
          if I move a server from includes to excludes, should that trigger online decommissioning and shut the machine down when decomissioning is done? or should we kick it out right away?

          I think the former is better. Looks like we are all agreeing on that?

          thanks

          Adrian Potra - at 3:35pm on April 21st

          • changed the subscribers. Removed: Adrian Potra.

          Vishal Kathuria - at 3:35pm on April 21st Sorry, I wrote my comment before I read Nicolas'

          I see his point - since the Region Servers don't have persisted data (which is in HDFS), we don't loose much by kicking them out.

          sounds reasonable to me.

          Dhruba Borthakur - at 4:37pm on April 21st
          @Nicolas: "but will assign out those regions after a small startup period has passed without onlining" – is it possible to avoid the small startup period mentioned above if the RS is decommissioned?

          Nicolas Spiegelberg - at 5:43pm on April 21st
          @Dhruba: agreed. We should just see what 'bin/hbase-daemon.sh stop regionserver" does. By my comment, I just meant to say that, if need be, we can actually just kick out the connection and do all the work on the master. As a safeguard either way, we should rename that RS's log directory (if it exists). This will forcibly cause the RS to die if something went awry.

          Show
          Vishal Kathuria added a comment - Copying comment history from facebook internal task: Dhruba Borthakur - at 3:09pm on April 21st andrew/matthew: do we need a discussion on how this feature needs to be designed? 1. There would be two files includes and excludes. 2. If a machine is listed in excludes file, it shuts down the regionserver on that machine. 3. Region servers should start only on machines that are listed in the includes file. 4. a command line utility to modify/update the excludes/includes file. Andrew Ryan - at 3:15pm on April 21st We've wanted parity from the beginning. So yes, online decommissioning should be supported. I don't see any reason to do otherwise. To Vishal's question, we wouldn't allow the node to join and migrate the regions. We just wouldn't allow it to join. Just shut the regionserver down. But, if a regionserver has active regions when it gets the decommission request, then it should gracefully reassign its regions and then shut down. Nicolas Spiegelberg - at 3:20pm on April 21st I think we just want to deny requests and force a regionserver already online to close. HBase just has a logical association with regions that is given upon onlining, stored in an HDFS file called META. We save the state across a graceful restart, but will assign out those regions after a small startup period has passed without onlining. Therefore, we don't benefit from trying to grab any state from the physical server itself. Vishal Kathuria - at 3:28pm on April 21st Thanks guys. One question is: if I move a server from includes to excludes, should that trigger online decommissioning and shut the machine down when decomissioning is done? or should we kick it out right away? I think the former is better. Looks like we are all agreeing on that? thanks Adrian Potra - at 3:35pm on April 21st changed the subscribers. Removed: Adrian Potra. Vishal Kathuria - at 3:35pm on April 21st Sorry, I wrote my comment before I read Nicolas' I see his point - since the Region Servers don't have persisted data (which is in HDFS), we don't loose much by kicking them out. sounds reasonable to me. Dhruba Borthakur - at 4:37pm on April 21st @Nicolas: "but will assign out those regions after a small startup period has passed without onlining" – is it possible to avoid the small startup period mentioned above if the RS is decommissioned? Nicolas Spiegelberg - at 5:43pm on April 21st @Dhruba: agreed. We should just see what 'bin/hbase-daemon.sh stop regionserver" does. By my comment, I just meant to say that, if need be, we can actually just kick out the connection and do all the work on the master. As a safeguard either way, we should rename that RS's log directory (if it exists). This will forcibly cause the RS to die if something went awry.
          dhruba borthakur created issue -

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development