Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-5155

ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.90.4
    • None
    • master
    • None
    • Hide
      This issue is an incompatible change.
      If an HBase client with the changes for HBASE-5155 and a server (master) without the changes for HBASE-5155 is used, then the is_enabled (from HBase Shell) or isTableEnabled() (from HBaseAdmin) will return false though the table is already enabled as per the master.

      If the HBase client does have the changes for HBASE-5155 and the server does not have the changes for HBASE-5155, then if we try to Enable a table then the client will hang.

      The reason is because,
      Prior to HBASE-5155 once the table is enabled the znode in the zookeeper created for the table is deleted.
      After HBASE-5155 once the table is enabled the znode in the zookeeper created for the table is not deleted, whereas the same node is updated with the status ENABLED.

      The client also expects the status of the znode in the zookeeper to be in the ENABLED state if the table has been enabled successfully.
      The above changes makes the client behaviour incompatible if the client does not have this fix whereas the server has this fix.
      If both the client and the server does not have this fix, then the behaviour is as expected.
      Show
      This issue is an incompatible change. If an HBase client with the changes for HBASE-5155 and a server (master) without the changes for HBASE-5155 is used, then the is_enabled (from HBase Shell) or isTableEnabled() (from HBaseAdmin) will return false though the table is already enabled as per the master. If the HBase client does have the changes for HBASE-5155 and the server does not have the changes for HBASE-5155 , then if we try to Enable a table then the client will hang. The reason is because, Prior to HBASE-5155 once the table is enabled the znode in the zookeeper created for the table is deleted. After HBASE-5155 once the table is enabled the znode in the zookeeper created for the table is not deleted, whereas the same node is updated with the status ENABLED. The client also expects the status of the znode in the zookeeper to be in the ENABLED state if the table has been enabled successfully. The above changes makes the client behaviour incompatible if the client does not have this fix whereas the server has this fix. If both the client and the server does not have this fix, then the behaviour is as expected.

    Description

      ServerShutDownHandler and disable/delete table handler races. This is not an issue due to TM.
      -> A regionserver goes down. In our cluster the regionserver holds lot of regions.
      -> A region R1 has two daughters D1 and D2.
      -> The ServerShutdownHandler gets called and scans the META and gets all the user regions
      -> Parallely a table is disabled. (No problem in this step).
      -> Delete table is done.
      -> The tables and its regions are deleted including R1, D1 and D2.. (So META is cleaned)
      -> Now ServerShutdownhandler starts to processTheDeadRegion

       if (hri.isOffline() && hri.isSplit()) {
            LOG.debug("Offlined and split region " + hri.getRegionNameAsString() +
              "; checking daughter presence");
            fixupDaughters(result, assignmentManager, catalogTracker);
      

      As part of fixUpDaughters as the daughers D1 and D2 is missing for R1

          if (isDaughterMissing(catalogTracker, daughter)) {
            LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString());
            MetaEditor.addDaughter(catalogTracker, daughter, null);
      
            // TODO: Log WARN if the regiondir does not exist in the fs.  If its not
            // there then something wonky about the split -- things will keep going
            // but could be missing references to parent region.
      
            // And assign it.
            assignmentManager.assign(daughter, true);
      

      we call assign of the daughers.
      Now after this we again start with the below code.

              if (processDeadRegion(e.getKey(), e.getValue(),
                  this.services.getAssignmentManager(),
                  this.server.getCatalogTracker())) {
                this.services.getAssignmentManager().assign(e.getKey(), true);
      

      Now when the SSH scanned the META it had R1, D1 and D2.
      So as part of the above code D1 and D2 which where assigned by fixUpDaughters
      is again assigned by

      this.services.getAssignmentManager().assign(e.getKey(), true);
      

      Thus leading to a zookeeper issue due to bad version and killing the master.
      The important part here is the regions that were deleted are recreated which i think is more critical.

      Attachments

        1. HBASE-5155_latest.patch
          19 kB
          ramkrishna.s.vasudevan
        2. hbase-5155_6.patch
          25 kB
          ramkrishna.s.vasudevan
        3. HBASE-5155_1.patch
          25 kB
          ramkrishna.s.vasudevan
        4. HBASE-5155_2.patch
          25 kB
          ramkrishna.s.vasudevan
        5. HBASE-5155_3.patch
          25 kB
          ramkrishna.s.vasudevan

        Activity

          People

            ram_krish ramkrishna.s.vasudevan
            ram_krish ramkrishna.s.vasudevan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: