HBase
  1. HBase
  2. HBASE-6240

Race in HCM.getMaster stalls clients

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.94.0
    • Fix Version/s: 0.94.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I found this issue trying to run YCSB on 0.94, I don't think it exists on any other branch. I believe that this was introduced in HBASE-5058 "Allow HBaseAdmin to use an existing connection".

      The issue is that in HCM.getMaster it does this recipe:

      1. Check if the master is null and runs (if so, return)
      2. Grab a lock on masterLock
      3. nullify this.master
      4. try to get a new master

      The issue happens at 3, it should re-run 1 since while you're waiting on the lock someone else could have already fixed it for you. What happens right now is that the threads are all able to set the master to null before others are able to get out of getMaster and it's a complete mess.

      Figuring it out took me some time because it doesn't manifest itself right away, silent retries are done in the background. Basically the first clue was this:

      Error doing get: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=10, exceptions:
      Tue Jun 19 23:40:46 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:40:47 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:40:48 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:40:49 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:40:51 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:40:53 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:40:57 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:41:01 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:41:09 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      Tue Jun 19 23:41:25 UTC 2012, org.apache.hadoop.hbase.client.HTable$3@571a4bd4, java.io.IOException: org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2eb0a3f5 closed
      

      This was caused by the little dance up in HBaseAdmin where it deletes "stale" connections... which are not stale at all.

      1. HBASE-6240_1_0.94.patch
        0.8 kB
        ramkrishna.s.vasudevan
      2. HBASE-6240.patch
        0.6 kB
        Jean-Daniel Cryans

        Activity

        Hide
        Jean-Daniel Cryans added a comment -

        Patch that adds the recheck inside the synchronized block, it fixes the issue in my test but I don't know if it breaks anything else at the moment. Only for 0.94

        Show
        Jean-Daniel Cryans added a comment - Patch that adds the recheck inside the synchronized block, it fixes the issue in my test but I don't know if it breaks anything else at the moment. Only for 0.94
        Hide
        Ted Yu added a comment -

        Nice finding.

        +1 on patch.

        Show
        Ted Yu added a comment - Nice finding. +1 on patch.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Pls check the latest patch. With JD's patch testcases related to Master restart's were failing, like TestMasterRestartAfterDisablingTable, TestSplitTransactionOnCluster. I think we need to handle UndeclaredThrowableException before getting the new master. Becuase the master.isRunning will throw an exception becuase the master has already switched. May be the problem is more prominent in our testcase framework. Pls review and provide your comments. If it is ok, I can commit this today.

        Show
        ramkrishna.s.vasudevan added a comment - Pls check the latest patch. With JD's patch testcases related to Master restart's were failing, like TestMasterRestartAfterDisablingTable, TestSplitTransactionOnCluster. I think we need to handle UndeclaredThrowableException before getting the new master. Becuase the master.isRunning will throw an exception becuase the master has already switched. May be the problem is more prominent in our testcase framework. Pls review and provide your comments. If it is ok, I can commit this today.
        Hide
        Jean-Daniel Cryans added a comment -

        Ah yeah I completely overlooked that. Did you see how we get this exception? It looks so dirty in the code and now having it twice would look a lot worse.

        Show
        Jean-Daniel Cryans added a comment - Ah yeah I completely overlooked that. Did you see how we get this exception? It looks so dirty in the code and now having it twice would look a lot worse.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Did you see how we get this exception?

        When we try to do master.isRunning(), the call goes to the master itself thro the RPC layer. So as the master is down we get this exception. Did you mean something else?
        Sorry, if my answer is not clear.

        Show
        ramkrishna.s.vasudevan added a comment - Did you see how we get this exception? When we try to do master.isRunning(), the call goes to the master itself thro the RPC layer. So as the master is down we get this exception. Did you mean something else? Sorry, if my answer is not clear.
        Hide
        Jean-Daniel Cryans added a comment -

        Did you mean something else?

        You answered right. It seems to me that if we get an exception, then either the master is down or it is unreachable. Unfortunately we don't have a concept for the latter.

        Looking at the code we do have a isMasterRunning() that encapsulates some of logic about looking up a master but strangely enough you'll never get false getting out of there, it's true or MasterNotRunningException!

        I think we should refactor this in a followup jira. In the meantime +1 on your patch Ram.

        Show
        Jean-Daniel Cryans added a comment - Did you mean something else? You answered right. It seems to me that if we get an exception, then either the master is down or it is unreachable. Unfortunately we don't have a concept for the latter. Looking at the code we do have a isMasterRunning() that encapsulates some of logic about looking up a master but strangely enough you'll never get false getting out of there, it's true or MasterNotRunningException! I think we should refactor this in a followup jira. In the meantime +1 on your patch Ram.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @JD
        +1 on opening a follow up JIRA. I can commit this today. Thanks.

        Show
        ramkrishna.s.vasudevan added a comment - @JD +1 on opening a follow up JIRA. I can commit this today. Thanks.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Committed to 0.94.
        Thanks JD for the patch and review.
        Thanks to Ted for the review.
        Will open a follow up JIRA to address JD's comments over here.

        Show
        ramkrishna.s.vasudevan added a comment - Committed to 0.94. Thanks JD for the patch and review. Thanks to Ted for the review. Will open a follow up JIRA to address JD's comments over here.
        Hide
        ramkrishna.s.vasudevan added a comment -

        HBASE-6273 raised.

        Show
        ramkrishna.s.vasudevan added a comment - HBASE-6273 raised.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #282 (See https://builds.apache.org/job/HBase-0.94/282/)
        HBASE-6240 Race in HCM.getMaster stalls clients

        Submitted by:J-D, Ram
        Reviewed by:J-D, Ted (Revision 1354116)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #282 (See https://builds.apache.org/job/HBase-0.94/282/ ) HBASE-6240 Race in HCM.getMaster stalls clients Submitted by:J-D, Ram Reviewed by:J-D, Ted (Revision 1354116) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #38 (See https://builds.apache.org/job/HBase-0.94-security/38/)
        HBASE-6240 Race in HCM.getMaster stalls clients

        Submitted by:J-D, Ram
        Reviewed by:J-D, Ted (Revision 1354116)

        Result = FAILURE
        ramkrishna :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #38 (See https://builds.apache.org/job/HBase-0.94-security/38/ ) HBASE-6240 Race in HCM.getMaster stalls clients Submitted by:J-D, Ram Reviewed by:J-D, Ted (Revision 1354116) Result = FAILURE ramkrishna : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development