HBase
  1. HBase
  2. HBASE-9139

Independent timeout configuration for rpc channel between cluster nodes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.94.10, 0.96.0
    • Fix Version/s: 0.95.2, 0.94.11, 0.96.0
    • Component/s: IPC/RPC, regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Default of "hbase.rpc.timeout" is 60000 ms (1 min). User sometimes
      increase them to a bigger value such as 600000 ms (10 mins) for many
      concurrent loading application from client. Some user share the same
      hbase-site.xml for both client and server. HRegionServer
      #tryRegionServerReport via rpc channel to report to live master, but
      there was a window for master failover scenario. That region server
      attempting to connect to master, which was just killed, backup master
      took the active role immediately and put to /hbase/master, but region
      server was still waiting for the rpc timeout from connecting to the dead
      master. If "hbase.rpc.timeout" is too long, this master failover process
      will be long due to long rpc timeout from dead master.

      If so, could we separate with 2 options, "hbase.rpc.timeout" is still
      for hbase client, while "hbase.rpc.internal.timeout" was for this
      regionserver/master rpc channel, which could be set shorted value
      without affect real client rpc timeout value?

      1. 9139-0.94-v0.patch
        1 kB
        Julian Zhou
      2. 9139-trunk-v0.patch
        2 kB
        Julian Zhou
      3. 9139-trunk-v1.patch
        3 kB
        Julian Zhou
      4. 9139-0.94-v1.patch
        3 kB
        Julian Zhou
      5. 9139-trunk-v1.patch
        3 kB
        Nicolas Liochon

        Activity

        Hide
        Nicolas Liochon added a comment -

        Even a 1 minute timeout is not ideal in this case: we know that the work to do server side is trivial, and we know it's idempotent so we can retry. So I would to tend to use a specific setting to use for such operations. It would be case by case. I don't have a good name for this setting, may be something like hbase.rpc.short.operation.timeout

        Show
        Nicolas Liochon added a comment - Even a 1 minute timeout is not ideal in this case: we know that the work to do server side is trivial, and we know it's idempotent so we can retry. So I would to tend to use a specific setting to use for such operations. It would be case by case. I don't have a good name for this setting, may be something like hbase.rpc.short.operation.timeout
        Hide
        Nick Dimiduk added a comment -

        Does it make more sense from a user perspective to have a global system setting called hbase.rpc.timeout and then allow for it to be overridden at the client level with hbase.client.rpc.timeout? This allows us to introduce alternative configurations for other timeout properties within the "internal" category, if that becomes necessary.

        (It also keeps the name consistent with other client-oriented configuration parameters.)

        Show
        Nick Dimiduk added a comment - Does it make more sense from a user perspective to have a global system setting called hbase.rpc.timeout and then allow for it to be overridden at the client level with hbase.client.rpc.timeout? This allows us to introduce alternative configurations for other timeout properties within the "internal" category, if that becomes necessary. (It also keeps the name consistent with other client-oriented configuration parameters.)
        Hide
        Nicolas Liochon added a comment -

        I think it could become difficult to manage: them we would have half-internal client like map reduce and so one.
        So to me, if you want different settings just use different configuration files.

        The real differences is in the operations imho. Some operations take no time and can be retried. So having the same timeout for such operations and for operations that can take longer (for example because they write in hdfs) is not good. Internally, in HBase code, we should use a different setting for such operations.

        Show
        Nicolas Liochon added a comment - I think it could become difficult to manage: them we would have half-internal client like map reduce and so one. So to me, if you want different settings just use different configuration files. The real differences is in the operations imho. Some operations take no time and can be retried. So having the same timeout for such operations and for operations that can take longer (for example because they write in hdfs) is not good. Internally, in HBase code, we should use a different setting for such operations.
        Hide
        Lars Hofhansl added a comment -

        I agree with Nicolas. The operation that Julian mentions definitely falls into that category.

        Show
        Lars Hofhansl added a comment - I agree with Nicolas. The operation that Julian mentions definitely falls into that category.
        Hide
        Nicolas Liochon added a comment -

        Julian Zhou, do you agree? Do you want to submit a patch?

        Show
        Nicolas Liochon added a comment - Julian Zhou , do you agree? Do you want to submit a patch?
        Hide
        Julian Zhou added a comment -

        Hi Nicolas Liochon, agree. I just attached the v0 patch for 0.94, currently, seems that only regionserver's reporting to master needs this short rpc timeout setting. HConnectionManager for client call is w/o this change. Could you help review? If the trunk logic for this was the same, I will attach the version for trunk afterwards. Thanks~

        Show
        Julian Zhou added a comment - Hi Nicolas Liochon , agree. I just attached the v0 patch for 0.94, currently, seems that only regionserver's reporting to master needs this short rpc timeout setting. HConnectionManager for client call is w/o this change. Could you help review? If the trunk logic for this was the same, I will attach the version for trunk afterwards. Thanks~
        Hide
        Julian Zhou added a comment -

        Currently, I just name it as "hbase.rpc.shortoperation.timeout", default is 10s.

        Show
        Julian Zhou added a comment - Currently, I just name it as "hbase.rpc.shortoperation.timeout", default is 10s.
        Hide
        Nicolas Liochon added a comment -

        Julian Zhou, I'm ok with the patch. It simpler than I though. Existing retry logic seems ok. I'm also ok with the naming and the default value.

        Yes, we need a patch for trunk/0.95 before applying it to 0.94 (it should be very similar). For 0.94, it's better to have a go from Lars Hofhansl

        Show
        Nicolas Liochon added a comment - Julian Zhou , I'm ok with the patch. It simpler than I though. Existing retry logic seems ok. I'm also ok with the naming and the default value. Yes, we need a patch for trunk/0.95 before applying it to 0.94 (it should be very similar). For 0.94, it's better to have a go from Lars Hofhansl
        Hide
        Julian Zhou added a comment -

        Hi Nicolas Liochon, attached the trunk patch v0. I searched out all reference places of HBASE_RPC_TIMEOUT_KEY and "hbase.rpc.timeout". Besides test/ code, only we have HCM and regionserver code to initialize the rpc timeout value. HCM still base on "hbase.rpc.timeout", so seems we only need to apply the new conf for regionserver's rpc timeout. So seems the change is straightforward and simple. Could you help review? Thanks Nicolas Liochon and Lars Hofhansl.

        Show
        Julian Zhou added a comment - Hi Nicolas Liochon , attached the trunk patch v0. I searched out all reference places of HBASE_RPC_TIMEOUT_KEY and "hbase.rpc.timeout". Besides test/ code, only we have HCM and regionserver code to initialize the rpc timeout value. HCM still base on "hbase.rpc.timeout", so seems we only need to apply the new conf for regionserver's rpc timeout. So seems the change is straightforward and simple. Could you help review? Thanks Nicolas Liochon and Lars Hofhansl .
        Hide
        Nicolas Liochon added a comment -

        I agree. It's great it's so simple at the end. Just one thing I forgot on my previous review: could you add this setting in hbase-common/main/resources/hbase-default.xml (next to the existing hbase.rpc.timeout), with a nice explanation of what it does?

        Thanks a lot, Julian.

        Show
        Nicolas Liochon added a comment - I agree. It's great it's so simple at the end. Just one thing I forgot on my previous review: could you add this setting in hbase-common/main/resources/hbase-default.xml (next to the existing hbase.rpc.timeout), with a nice explanation of what it does? Thanks a lot, Julian.
        Hide
        Julian Zhou added a comment -

        Hi Nicolas Liochon, attached the v1 for trunk and 0.94 patch with description in hbase-default.xml. Thanks for reviewing the wording, I will apply the comments if any.
        For hbase-default.xml in 0.94, we do not have description for "hbase.rpc.timeout", so I copy both "hbase.rpc.timeout" and "hbase.rpc.shortoperation.timeout" from trunk. Lars Hofhansl, is it ok to have them?

        Show
        Julian Zhou added a comment - Hi Nicolas Liochon , attached the v1 for trunk and 0.94 patch with description in hbase-default.xml. Thanks for reviewing the wording, I will apply the comments if any. For hbase-default.xml in 0.94, we do not have description for "hbase.rpc.timeout", so I copy both "hbase.rpc.timeout" and "hbase.rpc.shortoperation.timeout" from trunk. Lars Hofhansl , is it ok to have them?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12597108/9139-0.94-v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6703//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597108/9139-0.94-v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6703//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12597484/9139-trunk-v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        -1 javadoc. The javadoc tool appears to have generated 3 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        -1 core tests. The patch failed these unit tests:

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597484/9139-trunk-v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 3 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6704//console This message is automatically generated.
        Hide
        stack added a comment -

        +1 Looks like this rpc is only for reporting master status

        The failed unit test is TestAdmin; it is missing in list of tests. I'm guessing it is the testTableExists test that is failing.

        Show
        stack added a comment - +1 Looks like this rpc is only for reporting master status The failed unit test is TestAdmin; it is missing in list of tests. I'm guessing it is the testTableExists test that is failing.
        Hide
        Lars Hofhansl added a comment -

        +1 (0.94 and later)

        Show
        Lars Hofhansl added a comment - +1 (0.94 and later)
        Hide
        Lars Hofhansl added a comment -

        Ran TestAdmin with patch applied in 0.94. Passes fine. So this is good to commit as far as I am concerned.

        Show
        Lars Hofhansl added a comment - Ran TestAdmin with patch applied in 0.94. Passes fine. So this is good to commit as far as I am concerned.
        Hide
        Lars Hofhansl added a comment -

        Going to commit later today, unless I hear objections.

        Show
        Lars Hofhansl added a comment - Going to commit later today, unless I hear objections.
        Hide
        Lars Hofhansl added a comment -

        Committed to 0.94, 0.95, and trunk.

        Show
        Lars Hofhansl added a comment - Committed to 0.94, 0.95, and trunk.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94-security #253 (See https://builds.apache.org/job/HBase-0.94-security/253/)
        HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513341)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/resources/hbase-default.xml
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94-security #253 (See https://builds.apache.org/job/HBase-0.94-security/253/ ) HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513341) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/resources/hbase-default.xml
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.95-on-hadoop2 #238 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/238/)
        HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513338)

        • /hbase/branches/0.95/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/branches/0.95/hbase-common/src/main/resources/hbase-default.xml
        • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.95-on-hadoop2 #238 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/238/ ) HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513338) /hbase/branches/0.95/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.95/hbase-common/src/main/resources/hbase-default.xml /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.94 #1105 (See https://builds.apache.org/job/HBase-0.94/1105/)
        HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513341)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/resources/hbase-default.xml
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.94 #1105 (See https://builds.apache.org/job/HBase-0.94/1105/ ) HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513341) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/resources/hbase-default.xml
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-TRUNK #4378 (See https://builds.apache.org/job/HBase-TRUNK/4378/)
        HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513337)

        • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/trunk/hbase-common/src/main/resources/hbase-default.xml
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-TRUNK #4378 (See https://builds.apache.org/job/HBase-TRUNK/4378/ ) HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513337) /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/hbase-common/src/main/resources/hbase-default.xml /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #671 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/671/)
        HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513337)

        • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/trunk/hbase-common/src/main/resources/hbase-default.xml
        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #671 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/671/ ) HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513337) /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/hbase-common/src/main/resources/hbase-default.xml /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.95 #438 (See https://builds.apache.org/job/hbase-0.95/438/)
        HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513338)

        • /hbase/branches/0.95/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/branches/0.95/hbase-common/src/main/resources/hbase-default.xml
        • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.95 #438 (See https://builds.apache.org/job/hbase-0.95/438/ ) HBASE-9139 Independent timeout configuration for rpc channel between cluster nodes (Julian Zhou) (larsh: rev 1513338) /hbase/branches/0.95/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.95/hbase-common/src/main/resources/hbase-default.xml /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java

          People

          • Assignee:
            Julian Zhou
            Reporter:
            Julian Zhou
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development