Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-13605

RegionStates should not keep its list of dead servers

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Region Assignment
    • Labels:
      None

      Description

      As mentioned in https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761 and HBASE-12844 we should have only 1 source of cluster membership.

      The list of dead server and RegionStates doing it's own liveliness check (ServerManager.isServerReachable()) has caused an assignment problem again in a test cluster where the region states "thinks" that the server is dead and SSH will handle the region assignment. However the RS is not dead at all, living happily, and never gets zk expiry or YouAreDeadException or anything. This leaves the list of regions unassigned in OFFLINE state.

      master assigning the region:

      15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: Onlined 77dddcd50c22e56bfff133c0e1f9165b on os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => 77dddcd50c
      

      Master then disabled the table, and unassigned the region:

      2015-04-20 09:02:27,158 WARN  [ProcedureExecutorThread-1] zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING to DISABLING
       Starting unassign of loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining), current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, ts=1429520545780,   server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268}
      bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b.
      2015-04-20 09:02:27,414 INFO  [AM.ZK.Worker-pool3-t316] master.RegionStates: Offlined 77dddcd50c22e56bfff133c0e1f9165b from os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
      

      On table re-enable, AM does not assign the region:

      2015-04-20 09:02:30,415 INFO  [ProcedureExecutorThread-3] balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart assignment.ยท
      2015-04-20 09:02:30,415 INFO  [ProcedureExecutorThread-3] procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 server(s), retainAssignment=true
      
      l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't reach online server os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
      
      l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: Updating the state to OFFLINE to allow to be reassigned by SSH
      nmentManager: Skip assigning loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead but not processed yet server: os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268
      

        Attachments

        1. hbase-13605_v4-master.patch
          10 kB
          Enis Soztutar
        2. hbase-13605_v4-branch-1.1.patch
          14 kB
          Enis Soztutar
        3. hbase-13605_v3-branch-1.1.patch
          14 kB
          Enis Soztutar
        4. hbase-13605_v1.patch
          8 kB
          Enis Soztutar

          Activity

            People

            • Assignee:
              enis Enis Soztutar
              Reporter:
              enis Enis Soztutar
            • Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: