HBase
  1. HBase
  2. HBASE-1815

HBaseClient can get stuck in an infinite loop while attempting to contact a failed regionserver

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.20.1
    • Component/s: Client
    • Labels:
      None
    • Environment:

      Ubuntu Linux (Linux <elided> 2.6.24-23-generic #1 SMP Wed Apr 1 21:43:24 UTC 2009 x86_64 GNU/Linux), java version "1.6.0_06", Java(TM) SE Runtime Environment (build 1.6.0_06-b02), Java HotSpot(TM) 64-Bit Server VM (build 10.0-b22, mixed mode)

    • Hadoop Flags:
      Reviewed

      Description

      While using HBase Thrift server, if a regionserver goes down due to shutdown or failure clients will timeout because the thrift server cannot contact the dead regionserver.

      1. thrift_server_threaddump
        449 kB
        Justin Lynn
      2. thrift_server_log_excerpt
        184 kB
        Justin Lynn
      3. thrift_server_threaddump_1
        672 kB
        Justin Lynn
      4. ipctimeout.patch
        3 kB
        stack
      5. hbaseclient-v3.patch
        9 kB
        stack

        Activity

        Hide
        stack added a comment -

        Committed branch and trunk.

        Show
        stack added a comment - Committed branch and trunk.
        Hide
        stack added a comment -

        Yes. I should have so. I killed master and watched what regionservers did. I also killed the cluster and watched client. It all seems to run more regularly. Less weird retrying.

        Thanks for review.

        Show
        stack added a comment - Yes. I should have so. I killed master and watched what regionservers did. I also killed the cluster and watched client. It all seems to run more regularly. Less weird retrying. Thanks for review.
        Hide
        Jean-Daniel Cryans added a comment -

        +1 patch seems good. Apart from unit tests and loading, did you try killing some region servers?

        Show
        Jean-Daniel Cryans added a comment - +1 patch seems good. Apart from unit tests and loading, did you try killing some region servers?
        Hide
        stack added a comment -

        All unit tests pass.

        Show
        stack added a comment - All unit tests pass.
        Hide
        stack added a comment -

        I had this patch installed in my overnight loading. The upload worked about same as usual so this patch doesn't seem to change basic workings.

        Show
        stack added a comment - I had this patch installed in my overnight loading. The upload worked about same as usual so this patch doesn't seem to change basic workings.
        Hide
        stack added a comment -

        This version adds cleanup.

        In HRegionServer main run loop, wait before retrying rather than just run all retries without pause.

        Changed the HBaseRPC RetriesExhaustedException so its about failure to get proxy instead of a wonky message about unknown row.

        Move the get of a regionserver connection into the try/catch so if fails, its retried.

        This patch changes how our retrying from client and from servers works. I tested up on a cluster and it seems more regular and 'live' now than previous but I may have missed cases where we used to rely on the rpc retry. I'm not sure how to find those other than to commit and wait till someone complains.

        Review appreciated.

        Show
        stack added a comment - This version adds cleanup. In HRegionServer main run loop, wait before retrying rather than just run all retries without pause. Changed the HBaseRPC RetriesExhaustedException so its about failure to get proxy instead of a wonky message about unknown row. Move the get of a regionserver connection into the try/catch so if fails, its retried. This patch changes how our retrying from client and from servers works. I tested up on a cluster and it seems more regular and 'live' now than previous but I may have missed cases where we used to rely on the rpc retry. I'm not sure how to find those other than to commit and wait till someone complains. Review appreciated.
        Hide
        stack added a comment -

        This patch might be a bit radical, but here it goes.

        High-level motivation is undo retrying and sleeps down in ipc; let retrying be done at a higher level up in the hbase client.

        In ipc, socket setup had a timeout of 20 seconds. Ipc then retries the socket setup ten times with a 1 second sleep in between. Thats 210seconds or so before we timeout down in the guts of RPC. We then go up to the retry logic in hbase (usually, not always), and then do ten retries with a 2 second retry in between (If a SocketTimeoutException exception setting up the connection, we'd retry a hard-coded 45 times; i.e. 15 minutes).

        In Justin's case, I don't think we were doing SocketTimeoutException going by the stack trace. It was more the 210 seconds per thread but my guess is that his thrift client had probably timed out already.

        This patch turns off retry down in the ipc client (let the upper-layers do retry), changes hard-coded sleep times to be hbase.client.pause time (2 seconds), and removes the 45 hard-coding, It also adds an hbase prefix to the ipc configuration parameters in case we want different values from hadoop.

        Let me try out this patch. My guess is that there are places in hbase where we don't retry because we were dependent on ipc doing retry for us. Let me find those.

        Show
        stack added a comment - This patch might be a bit radical, but here it goes. High-level motivation is undo retrying and sleeps down in ipc; let retrying be done at a higher level up in the hbase client. In ipc, socket setup had a timeout of 20 seconds. Ipc then retries the socket setup ten times with a 1 second sleep in between. Thats 210seconds or so before we timeout down in the guts of RPC. We then go up to the retry logic in hbase (usually, not always), and then do ten retries with a 2 second retry in between (If a SocketTimeoutException exception setting up the connection, we'd retry a hard-coded 45 times; i.e. 15 minutes). In Justin's case, I don't think we were doing SocketTimeoutException going by the stack trace. It was more the 210 seconds per thread but my guess is that his thrift client had probably timed out already. This patch turns off retry down in the ipc client (let the upper-layers do retry), changes hard-coded sleep times to be hbase.client.pause time (2 seconds), and removes the 45 hard-coding, It also adds an hbase prefix to the ipc configuration parameters in case we want different values from hadoop. Let me try out this patch. My guess is that there are places in hbase where we don't retry because we were dependent on ipc doing retry for us. Let me find those.
        Hide
        stack added a comment -

        HBaseClient also has this issue from list:

        Yeah, this is down in guts of the hadoop rpc we use. Around connection setup it has its own config. that is not well aligned with ours (ours being the retries and pause settings)

        The maxretriies down in ipc is

        this.maxRetries = conf.getInt("ipc.client.connect.max.retries", 10);

        Thats for an IOE other than timeout. For timeout, it does this:

        } catch (SocketTimeoutException toe) {
        /* The max number of retries is 45,

        • which amounts to 20s*45 = 15 minutes retries.
          */
          handleConnectionFailure(timeoutFailures++, 45, toe);

        Let me file an issue to address the above. The retries should be our retries... and in here it has a hardcoded 1000ms that instead should be our pause.... Not hard to fix.

        Show
        stack added a comment - HBaseClient also has this issue from list: Yeah, this is down in guts of the hadoop rpc we use. Around connection setup it has its own config. that is not well aligned with ours (ours being the retries and pause settings) The maxretriies down in ipc is this.maxRetries = conf.getInt("ipc.client.connect.max.retries", 10); Thats for an IOE other than timeout. For timeout, it does this: } catch (SocketTimeoutException toe) { /* The max number of retries is 45, which amounts to 20s*45 = 15 minutes retries. */ handleConnectionFailure(timeoutFailures++, 45, toe); Let me file an issue to address the above. The retries should be our retries... and in here it has a hardcoded 1000ms that instead should be our pause.... Not hard to fix.
        Hide
        Andrew Purtell added a comment -

        Clients can monitor RS liveness via ZK and respond quickly via watches?

        Show
        Andrew Purtell added a comment - Clients can monitor RS liveness via ZK and respond quickly via watches?
        Hide
        stack added a comment -

        Working w/ JSharp, looking in the thread dumps, it looks like each thread has to do ten retries sleeping a second between each retry. When many threads, we get a lot of messages in the log about the failure to connect. Need to recognize dead-remote-side and handle it promptly.

        Show
        stack added a comment - Working w/ JSharp, looking in the thread dumps, it looks like each thread has to do ten retries sleeping a second between each retry. When many threads, we get a lot of messages in the log about the failure to connect. Need to recognize dead-remote-side and handle it promptly.
        Hide
        Justin Lynn added a comment -

        Another thread dump.

        Show
        Justin Lynn added a comment - Another thread dump.
        Hide
        Justin Lynn added a comment -

        These are the thrift server threaddumps and log files from the time when the failure was noticed.

        Show
        Justin Lynn added a comment - These are the thrift server threaddumps and log files from the time when the failure was noticed.

          People

          • Assignee:
            stack
            Reporter:
            Justin Lynn
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development