HBase
  1. HBase
  2. HBASE-5781

Zookeeper session got closed while trying to assign the region to RS using hbck -fix

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.7, 0.92.1, 0.94.0, 0.95.2
    • Fix Version/s: 0.92.2, 0.94.0, 0.95.0
    • Component/s: hbck
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      After running the hbck in the cluster ,it is found that one region is not assigned
      So the hbck -fix is used to fix this
      But the assignment didnt happen since the zookeeper session is closed
      Please find the attached trace for more details
      -----------------------------------------
      Trying to fix unassigned region...
      12/04/03 11:02:57 INFO util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned:

      {NAME => 'ufdr,002300,1333379123498.00871fbd7583512e12c4eb38e900be8d.', STARTKEY => '002300', ENDKEY => '002311', ENCODED => 00871fbd7583512e12c4eb38e900be8d,}

      12/04/03 11:02:58 INFO client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x236738a2630000a
      12/04/03 11:02:58 INFO zookeeper.ZooKeeper: Session: 0x236738a2630000a closed
      ERROR: Region

      { meta => ufdr,010444,1333379123857.01594219211d0035b9586f98954462e1., hdfs => hdfs://10.18.40.25:9000/hbase/ufdr/01594219211d0035b9586f98954462e1, deployed => }

      not deployed on any region server.
      Trying to fix unassigned region...
      12/04/03 11:02:58 INFO zookeeper.ClientCnxn: EventThread shut down
      12/04/03 11:02:58 WARN zookeeper.ZKUtil: hconnection-0x236738a2630000a Unable to set watcher on znode (/hbase)
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
      at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:150)
      at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:263)
      at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.checkIfBaseNodeAvailable(ZooKeeperNodeTracker.java:208)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.checkIfBaseNodeAvailable(HConnectionManager.java:695)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:626)
      at org.apache.hadoop.hbase.client.HBaseAdmin.getMaster(HBaseAdmin.java:211)
      at org.apache.hadoop.hbase.client.HBaseAdmin.assign(HBaseAdmin.java:1325)
      at org.apache.hadoop.hbase.util.HBaseFsckRepair.forceOfflineInZK(HBaseFsckRepair.java:109)
      at org.apache.hadoop.hbase.util.HBaseFsckRepair.fixUnassigned(HBaseFsckRepair.java:92)
      at org.apache.hadoop.hbase.util.HBaseFsck.tryAssignmentRepair(HBaseFsck.java:1235)
      at org.apache.hadoop.hbase.util.HBaseFsck.checkRegionConsistency(HBaseFsck.java:1351)
      at org.apache.hadoop.hbase.util.HBaseFsck.checkAndFixConsistency(HBaseFsck.java:1114)
      at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:356)
      at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:375)
      at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:2894)
      12/04/03 11:02:58 ERROR zookeeper.ZooKeeperWatcher: hconnection-0x236738a2630000a Received unexpected KeeperException, re-throwing exception
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
      at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:150)
      at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:263)
      at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.checkIfBaseNodeAvailable(ZooKeeperNodeTracker.java:208)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.checkIfBaseNodeAvailable(HConnectionManager.java:695)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:626)
      at org.apache.hadoop.hbase.client.HBaseAdmin.getMaster(HBaseAdmin.java:211)
      at org.apache.hadoop.hbase.client.HBaseAdmin.assign(HBaseAdmin.java:1325)
      at org.apache.hadoop.hbase.util.HBaseFsckRepair.forceOfflineInZK(HBaseFsckRepair.java:109)
      at org.apache.hadoop.hbase.util.HBaseFsckRepair.fixUnassigned(HBaseFsckRepair.java:92)
      at org.apache.hadoop.hbase.util.HBaseFsck.tryAssignmentRepair(HBaseFsck.java:1235)
      at org.apache.hadoop.hbase.util.HBaseFsck.checkRegionConsistency(HBaseFsck.java:1351)
      at org.apache.hadoop.hbase.util.HBaseFsck.checkAndFixConsistency(HBaseFsck.java:1114)
      at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:356)
      at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:375)
      at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:2894)
      12/04/03 11:02:58 INFO client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, trying to reconnect.
      12/04/03 11:02:58 INFO client.HConnectionManager$HConnectionImplementation: Trying to reconnect to zookeeper
      12/04/03 11:02:58 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=10.18.40.21:2181,10.18.40.25:2181,10.18.40.93:2181 sessionTimeout=60000 watcher=hconnection
      12/04/03 11:02:58 INFO zookeeper.ClientCnxn: Opening socket connection to server /10.18.40.93:2181
      12/04/03 11:02:58 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 18333@HOST-10-18-40-93
      12/04/03 11:02:58 WARN client.ZooKeeperSaslClient: SecurityException: java.lang.SecurityException: Unable to locate a login configuration occurred when trying to find JAAS configuration.
      12/04/03 11:02:58 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration.
      12/04/03 11:02:58 INFO zookeeper.ClientCnxn: Socket connection established to HOST-10-18-40-93/10.18.40.93:2181, initiating session
      12/04/03 11:02:58 INFO zookeeper.ClientCnxn: Session establishment complete on server HOST-10-18-40-93/10.18.40.93:2181, sessionid = 0x3367392d5140018, negotiated timeout = 40000
      12/04/03 11:02:58 INFO client.HConnectionManager$HConnectionImplementation: Reconnected successfully. This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you verify your environment.
      Exception in thread "main" org.apache.hadoop.hbase.MasterNotRunningException
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:686)
      at org.apache.hadoop.hbase.client.HBaseAdmin.getMaster(HBaseAdmin.java:211)
      at org.apache.hadoop.hbase.client.HBaseAdmin.assign(HBaseAdmin.java:1325)
      at org.apache.hadoop.hbase.util.HBaseFsckRepair.forceOfflineInZK(HBaseFsckRepair.java:109)
      at org.apache.hadoop.hbase.util.HBaseFsckRepair.fixUnassigned(HBaseFsckRepair.java:92)
      at org.apache.hadoop.hbase.util.HBaseFsck.tryAssignmentRepair(HBaseFsck.java:1235)
      at org.apache.hadoop.hbase.util.HBaseFsck.checkRegionConsistency(HBaseFsck.java:1351)
      at org.apache.hadoop.hbase.util.HBaseFsck.checkAndFixConsistency(HBaseFsck.java:1114)
      at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:356)
      at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:375)
      at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:2894)
      Please find the attached file for more details..

      1. hbase-5781.patch
        2 kB
        Jonathan Hsieh

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          @Kristam What versions are you using? (can you fill out the affects version?)

          I actually ran into this problem earlier today and have been spending some time investigating.

          Show
          Jonathan Hsieh added a comment - @Kristam What versions are you using? (can you fill out the affects version?) I actually ran into this problem earlier today and have been spending some time investigating.
          Hide
          Anoop Sam John added a comment -

          In HBaseFsckRepair.waitUntilAssigned

          finally {
                try {
                  connection.close();
                } catch (IOException ioe) {
                  throw ioe;
                }
              }
          

          This close caused the exception as per my observation. This method is being called in assignmentRepair, regionConsistencyRepair and metaRepair...

          Now if the HBCK fix needs to fix all these kind of issues or atleast 2 of these issues, the close would happen before the other fix.. I have not done detailed check.. Just observation as per the logs..

          Mean while tested the current RC for 0.94 version

          Show
          Anoop Sam John added a comment - In HBaseFsckRepair.waitUntilAssigned finally { try { connection.close(); } catch (IOException ioe) { throw ioe; } } This close caused the exception as per my observation. This method is being called in assignmentRepair, regionConsistencyRepair and metaRepair... Now if the HBCK fix needs to fix all these kind of issues or atleast 2 of these issues, the close would happen before the other fix.. I have not done detailed check.. Just observation as per the logs.. Mean while tested the current RC for 0.94 version
          Hide
          Jonathan Hsieh added a comment -

          I looked into this briefly. The version I've used on production systems doesn't have this finally/close portion in it. HBASE-3777 (on trunk/0.92/0.94) and HBASE-4508 (backport of HBASE-3777) on 0.90 added this.

          I think if you remove the extra try-finally-close the -fix will work again, (but may leak resources). Can you give the modification a try?

          Show
          Jonathan Hsieh added a comment - I looked into this briefly. The version I've used on production systems doesn't have this finally/close portion in it. HBASE-3777 (on trunk/0.92/0.94) and HBASE-4508 (backport of HBASE-3777 ) on 0.90 added this. I think if you remove the extra try-finally-close the -fix will work again, (but may leak resources). Can you give the modification a try?
          Hide
          Anoop Sam John added a comment -

          @Jon
          Yes removing this finally close makes the test run and fix the issues that HBCK finds..

          I will take a detailed look at the code on Monday so that we can close the issue... Or if u are giving a patch it is fine

          Show
          Anoop Sam John added a comment - @Jon Yes removing this finally close makes the test run and fix the issues that HBCK finds.. I will take a detailed look at the code on Monday so that we can close the issue... Or if u are giving a patch it is fine
          Hide
          Jonathan Hsieh added a comment -

          @Anoop.

          If you want to supply a patch that would be great. We would definitely want to get this into 0.94! Currently, there is a hanging test in trunk's TestHBaseFsck (HBASE-5973) that I'm hunting down, so if you can give me the a output of a run on 0.94 I'd be happy.

          If it ends up being having a resource leak, I'd say that since hbck isn't long running, it would probably be ok to later as long as we noted it with a follow on jira.

          Show
          Jonathan Hsieh added a comment - @Anoop. If you want to supply a patch that would be great. We would definitely want to get this into 0.94! Currently, there is a hanging test in trunk's TestHBaseFsck ( HBASE-5973 ) that I'm hunting down, so if you can give me the a output of a run on 0.94 I'd be happy. If it ends up being having a resource leak, I'd say that since hbck isn't long running, it would probably be ok to later as long as we noted it with a follow on jira.
          Hide
          Lars Hofhansl added a comment -

          @Jon: Are you saying sink rc1 for this?

          Show
          Lars Hofhansl added a comment - @Jon: Are you saying sink rc1 for this?
          Hide
          Jonathan Hsieh added a comment -

          @Lars Yeah, this is a feature regression. I'm unit testing the suggested fix right now, will test on a borked testing cluster I have.

          Show
          Jonathan Hsieh added a comment - @Lars Yeah, this is a feature regression. I'm unit testing the suggested fix right now, will test on a borked testing cluster I have.
          Hide
          Lars Hofhansl added a comment -

          Dang

          Show
          Lars Hofhansl added a comment - Dang
          Hide
          Jonathan Hsieh added a comment - - edited

          Implemented fix suggested in conversation. Applied to 0.92.x based hbase, and confirmed that hbck's assignment operations worked.

          • Fixed a borked test cluster
          • On ok cluster, use hbase shell to closed a region, ran updated hbck to verify detected, ran 'hbck -fix' to fix assignment and problem was repaired.

          Note for this to pass on trunk, HBASE-5793 is needed as well.

          Show
          Jonathan Hsieh added a comment - - edited Implemented fix suggested in conversation. Applied to 0.92.x based hbase, and confirmed that hbck's assignment operations worked. Fixed a borked test cluster On ok cluster, use hbase shell to closed a region, ran updated hbck to verify detected, ran 'hbck -fix' to fix assignment and problem was repaired. Note for this to pass on trunk, HBASE-5793 is needed as well.
          Hide
          Jonathan Hsieh added a comment -

          @Anoop – I checked the HConnection code and it seems like there should be no connection leaks – the suggested patch seems clean, and doesn't need follow up work.

          This something I missed when I ported HBASE-5128 to trunk branches.

          I'll commit this if an reviews +1 this, or if first thing Monday unless there are any concerns.

          Show
          Jonathan Hsieh added a comment - @Anoop – I checked the HConnection code and it seems like there should be no connection leaks – the suggested patch seems clean, and doesn't need follow up work. This something I missed when I ported HBASE-5128 to trunk branches. I'll commit this if an reviews +1 this, or if first thing Monday unless there are any concerns.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522691/hbase-5781.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.wal.TestHLog
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1529//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1529//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1529//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522691/hbase-5781.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestHLog org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1529//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1529//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1529//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          +1 on patch.

          Show
          Lars Hofhansl added a comment - +1 on patch.
          Hide
          Anoop Sam John added a comment -

          Yes finally code can be removed I think... [Any way HBCK is short living process also] Thanks Jon for the patch. Sorry I could not check it yday night.
          @Lars we are eagerly waiting for the good news from you regarding 94 release..

          Show
          Anoop Sam John added a comment - Yes finally code can be removed I think... [Any way HBCK is short living process also] Thanks Jon for the patch. Sorry I could not check it yday night. @Lars we are eagerly waiting for the good news from you regarding 94 release..
          Hide
          Jonathan Hsieh added a comment -

          Anoop, Kristam, thanks for finding this and hunting down the root cause of the problem. Thanks for the quick review Lars.

          Committed to 0.90/0.92/0.94. Trunk is broken currently because of HBASE-5747, HBASE-5793.

          Show
          Jonathan Hsieh added a comment - Anoop, Kristam, thanks for finding this and hunting down the root cause of the problem. Thanks for the quick review Lars. Committed to 0.90/0.92/0.94. Trunk is broken currently because of HBASE-5747 , HBASE-5793 .
          Hide
          Jonathan Hsieh added a comment -

          Oh – but it is committed to trunk as well.

          Show
          Jonathan Hsieh added a comment - Oh – but it is committed to trunk as well.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2763 (See https://builds.apache.org/job/HBase-TRUNK/2763/)
          HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326280)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2763 (See https://builds.apache.org/job/HBase-TRUNK/2763/ ) HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326280) Result = FAILURE jmhsieh : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #117 (See https://builds.apache.org/job/HBase-0.94/117/)
          HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326281)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #117 (See https://builds.apache.org/job/HBase-0.94/117/ ) HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326281) Result = FAILURE jmhsieh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #371 (See https://builds.apache.org/job/HBase-0.92/371/)
          HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326282)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #371 (See https://builds.apache.org/job/HBase-0.92/371/ ) HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326282) Result = FAILURE jmhsieh : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #172 (See https://builds.apache.org/job/HBase-TRUNK-security/172/)
          HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326280)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #172 (See https://builds.apache.org/job/HBase-TRUNK-security/172/ ) HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326280) Result = FAILURE jmhsieh : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #105 (See https://builds.apache.org/job/HBase-0.92-security/105/)
          HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326282)

          Result = FAILURE
          jmhsieh :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #105 (See https://builds.apache.org/job/HBase-0.92-security/105/ ) HBASE-5781 Zookeeper session got closed while trying to assign the region to RS using hbck -fix (Revision 1326282) Result = FAILURE jmhsieh : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java

            People

            • Assignee:
              Jonathan Hsieh
              Reporter:
              Kristam Subba Swathi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development