HBase
  1. HBase
  2. HBASE-3809

.META. may not come back online if > number of executors servers crash and one of those > number of executors was carrying meta

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is a duplicate of another issue but at the moment I cannot find the original.

      If you had a 700 node cluster and then you ran something on the cluster which killed 100 nodes, and .META. had been running on one of those downed nodes, well, you'll have all of your master executors processing ServerShutdowns and more than likely non of the currently processing executors will be servicing the shutdown of the server that was carrying .META.

      Well, for server shutdown to complete at the moment, an online .META. is required. So, in the above case, we'll be stuck. The current executors will not be able to clear to make space for the processing of the server carrying .META. because they need .META. to complete.

      We can make the master handlers have no bound so it will expand to accomodate all crashed servers – so it'll have the one .META. in its queue – or we can change it so shutdown handling doesn't require .META. to be on-line (its used to figure the regions the server was carrying); we could use the master's in-memory picture of the cluster (But IIRC, there may be holes ....TBD)

        Issue Links

          Activity

          Hide
          Jean-Daniel Cryans added a comment -

          In the same vein (having to rely on .META. for region server shutdown), we saw an issue yesterday where the balancer started just before a region server was cleanly shutdown. In sequence:

          • Balancer starts unassigning regions
          • RS starts closing a few regions for balancing
          • RS is told to stop
          • Master initiates the region server shutdown handler which scans .META. for regions that are on that region server
          • Regions are being unassigned and moved while the master force unassigns regions that (he thinks) are on the RS
          • At the end, 25 out of 500 regions are double assigned because they were already reassigned when the server shutdown reassigns them.

          This happens because the master relies on potentially stale information when forcing the unassign. According to the comments in the code, we still have to scan to check against splits. The workaround is to disable the balancer before shutting down a region server (like rolling restart does).

          hbck fixed the double assignment without any trouble.

          Show
          Jean-Daniel Cryans added a comment - In the same vein (having to rely on .META. for region server shutdown), we saw an issue yesterday where the balancer started just before a region server was cleanly shutdown. In sequence: Balancer starts unassigning regions RS starts closing a few regions for balancing RS is told to stop Master initiates the region server shutdown handler which scans .META. for regions that are on that region server Regions are being unassigned and moved while the master force unassigns regions that (he thinks) are on the RS At the end, 25 out of 500 regions are double assigned because they were already reassigned when the server shutdown reassigns them. This happens because the master relies on potentially stale information when forcing the unassign. According to the comments in the code, we still have to scan to check against splits. The workaround is to disable the balancer before shutting down a region server (like rolling restart does). hbck fixed the double assignment without any trouble.
          Hide
          Ted Yu added a comment -

          One potential issue is that it takes some time for region server to shut down.
          If other region servers are overloaded during this period of time, balancer wouldn't be able to help.

          The first step of region server shutdown would be to inform load balancer not to unassign regions off the server.

          Show
          Ted Yu added a comment - One potential issue is that it takes some time for region server to shut down. If other region servers are overloaded during this period of time, balancer wouldn't be able to help. The first step of region server shutdown would be to inform load balancer not to unassign regions off the server.
          Hide
          Lars Hofhansl added a comment -

          Moving out of 0.94.

          Show
          Lars Hofhansl added a comment - Moving out of 0.94.
          Hide
          chunhui shen added a comment -

          I think it won't happen in trunk now.Because:
          1.We use different ExecutorService to execute ServerShutdownHandler and MetaServerShutdownHandler
          2.In the process of MetaServerShutdownHandler

          if (isCarryingRoot() || isCarryingMeta() // -ROOT- or .META.
                    || !services.getAssignmentManager().isFailoverCleanupDone()) {
                  this.services.getServerManager().processDeadServer(serverName);
                  return;
                }
          

          It means MetaServerShutdownHandler could always be executed, so this stuck scenario won't happen again

          Show
          chunhui shen added a comment - I think it won't happen in trunk now.Because: 1.We use different ExecutorService to execute ServerShutdownHandler and MetaServerShutdownHandler 2.In the process of MetaServerShutdownHandler if (isCarryingRoot() || isCarryingMeta() // -ROOT- or .META. || !services.getAssignmentManager().isFailoverCleanupDone()) { this .services.getServerManager().processDeadServer(serverName); return ; } It means MetaServerShutdownHandler could always be executed, so this stuck scenario won't happen again
          Hide
          Ted Yu added a comment -

          What about the scenario J-D described @ 22/Apr/11 21:48 ?

          Show
          Ted Yu added a comment - What about the scenario J-D described @ 22/Apr/11 21:48 ?
          Hide
          chunhui shen added a comment -

          It is a common multi-assign scenario J-D described @ 22/Apr/11 21:48 in early version.

          We have done many works to fix multi-assign cases,
          So it won't be a problem now.

          Show
          chunhui shen added a comment - It is a common multi-assign scenario J-D described @ 22/Apr/11 21:48 in early version. We have done many works to fix multi-assign cases, So it won't be a problem now.
          Hide
          stack added a comment -

          Resolving as no longer valid based off Chunhui's argument above. Thanks Chunhui.

          Show
          stack added a comment - Resolving as no longer valid based off Chunhui's argument above. Thanks Chunhui.
          Hide
          Ted Yu added a comment -

          Here is code snippet from ServerManager.expireServer():

                this.services.getExecutorService().submit(new MetaServerShutdownHandler(this.master,
                  this.services, this.deadservers, serverName, carryingRoot, carryingMeta));
              } else {
                this.services.getExecutorService().submit(new ServerShutdownHandler(this.master,
          
          Show
          Ted Yu added a comment - Here is code snippet from ServerManager.expireServer(): this .services.getExecutorService().submit( new MetaServerShutdownHandler( this .master, this .services, this .deadservers, serverName, carryingRoot, carryingMeta)); } else { this .services.getExecutorService().submit( new ServerShutdownHandler( this .master,
          Hide
          stack added a comment -

          What is the point that you are trying to make @Ted Yu?

          Show
          stack added a comment - What is the point that you are trying to make @Ted Yu?
          Hide
          Ted Yu added a comment -

          I was trying to find out which other ExecutorService is used to execute MetaServerShutdownHandler.
          In MasterServices, there is only one method returning ExecutorService:

            public ExecutorService getExecutorService();
          

          In HMaster, I only found one ExecutorService member variable:

            // Instance of the hbase executor service.
            ExecutorService executorService;
          
          Show
          Ted Yu added a comment - I was trying to find out which other ExecutorService is used to execute MetaServerShutdownHandler. In MasterServices, there is only one method returning ExecutorService: public ExecutorService getExecutorService(); In HMaster, I only found one ExecutorService member variable: // Instance of the hbase executor service. ExecutorService executorService;
          Hide
          stack added a comment -

          and... Ted Yu your point is?

          Show
          stack added a comment - and... Ted Yu your point is?
          Hide
          Ted Yu added a comment -

          If I read the code correctly, there is only one ExecutorService running both MetaServerShutdownHandler and ServerShutdownHandler.
          Point #1 from the comment @ 08/Jan/13 05:46 may not be true.

          Show
          Ted Yu added a comment - If I read the code correctly, there is only one ExecutorService running both MetaServerShutdownHandler and ServerShutdownHandler. Point #1 from the comment @ 08/Jan/13 05:46 may not be true.
          Hide
          Jimmy Xiang added a comment -

          @Ted, check ExecutorService#getExecutor(final ExecutorType type). Chunhui is right.

          Show
          Jimmy Xiang added a comment - @Ted, check ExecutorService#getExecutor(final ExecutorType type). Chunhui is right.
          Hide
          stack added a comment -

          Point #1 from the comment @ 08/Jan/13 05:46 may not be true.

          Ted Yu There is no comment at the above noted time.

          Show
          stack added a comment - Point #1 from the comment @ 08/Jan/13 05:46 may not be true. Ted Yu There is no comment at the above noted time.
          Hide
          Ted Yu added a comment -

          @Jimmy:
          Thanks for the reminder.
          Reading into HMaster.startServiceThreads(), I found the answer.

          Show
          Ted Yu added a comment - @Jimmy: Thanks for the reminder. Reading into HMaster.startServiceThreads(), I found the answer.

            People

            • Assignee:
              chunhui shen
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development