HBase
  1. HBase
  2. HBASE-8723

HBase Intgration tests are failing because of new defaults.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.95.0
    • Fix Version/s: 0.98.0, 0.95.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Changed default number of RPC reties to 30 to ensure the client doesn't give up too soon during a region fail over.

      Description

      Currently any IT tests that have chaos monkey fail because we are not recovering regions before the number of RPC reties is exhausted.

      We should set that default higher.

      1. HBASE-8723-1.patch
        2 kB
        Elliott Clark
      2. HBASE-8723-0.patch
        2 kB
        Elliott Clark

        Activity

        Hide
        Elliott Clark added a comment -

        My hdfs-site has lots of settings that should make fail over faster but from kill to region open is still long enough that we fail it tests:

        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        <configuration>
        
          <property>
            <name>dfs.datanode.handler.count</name>
            <!-- default 10 -->
            <value>32</value>
            <description>The number of server threads for the
            datanode.</description>
          </property>
        
          <property>
            <name>dfs.namenode.handler.count</name>
            <!-- default 10 -->
            <value>32</value>
            <description>The number of server threads for the
            namenode.</description>
          </property>
        
          <property>
            <name>dfs.block.size</name>
            <value>134217728</value>
            <description>The default block size for new files.</description>
          </property>
        
          <property>
            <name>dfs.datanode.max.xcievers</name>
            <value>4098</value>
          </property>
        
          <property>
            <name>dfs.namenode.replication.interval</name>
            <value>15</value>
          </property>
        
          <property>
            <name>dfs.balance.bandwidthPerSec</name>
            <value>10485760</value>
          </property>
        
          <property>
            <name>fs.checkpoint.dir</name>
            <value>${hadoop.data.dir1}/dfs/namesecondary</value>
          </property>
        
          <property>
            <name>dfs.name.dir</name>
            <value>${hadoop.data.dir0}/dfs/name</value>
          </property>
        
          <property>
            <name>dfs.data.dir</name>
            <value>${hadoop.data.dir0}/dfs/data,${hadoop.data.dir1}/dfs/data,${hadoop.data.dir2}/dfs/data,${hadoop.data.dir3}/dfs/data,${hadoop.data.dir4}/dfs/data,${hadoop.data.dir5}/dfs/data,${hadoop.data.dir6}/dfs/data</value>
          </property>
        
          <property>
            <name>dfs.datanode.socket.write.timeout</name>
            <value>10000</value>
          </property>
        
          <property>
            <name>ipc.client.connect.timeout</name>
            <value>1000</value>
          </property>
        
          <property>
            <name>ipc.client.connect.max.retries.on.timeouts</name>
            <value>2</value>
          </property>
        
          <property>
            <name>dfs.socket.timeout</name>
            <value>5000</value>
          </property>
        
          <property>
            <name>dfs.socket.write.timeout</name>
            <value>5000</value>
          </property>
        
          <property>
            <name>dfs.domain.socket.path</name>
            <value>/var/lib/hadoop/dn_socket._PORT</value>
          </property>
        
          <property>
            <name>dfs.block.local-path-access.user</name>
            <value>hbase</value>
          </property>
        
          <property>
             <name>dfs.client.read.shortcircuit.skip.checksum</name>
             <value>true</value>
           </property>
        
          <property>
            <name>dfs.client.file-block-storage-locations.timeout</name>
            <value>3000</value>
          </property>
        
        </configuration>
        
        Show
        Elliott Clark added a comment - My hdfs-site has lots of settings that should make fail over faster but from kill to region open is still long enough that we fail it tests: <?xml version= "1.0" ?> <?xml-stylesheet type= "text/xsl" href= "configuration.xsl" ?> <configuration> <property> <name>dfs.datanode.handler.count</name> <!-- default 10 --> <value>32</value> <description>The number of server threads for the datanode.</description> </property> <property> <name>dfs.namenode.handler.count</name> <!-- default 10 --> <value>32</value> <description>The number of server threads for the namenode.</description> </property> <property> <name>dfs.block.size</name> <value>134217728</value> <description>The default block size for new files.</description> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4098</value> </property> <property> <name>dfs.namenode.replication.interval</name> <value>15</value> </property> <property> <name>dfs.balance.bandwidthPerSec</name> <value>10485760</value> </property> <property> <name>fs.checkpoint.dir</name> <value>${hadoop.data.dir1}/dfs/namesecondary</value> </property> <property> <name>dfs.name.dir</name> <value>${hadoop.data.dir0}/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>${hadoop.data.dir0}/dfs/data,${hadoop.data.dir1}/dfs/data,${hadoop.data.dir2}/dfs/data,${hadoop.data.dir3}/dfs/data,${hadoop.data.dir4}/dfs/data,${hadoop.data.dir5}/dfs/data,${hadoop.data.dir6}/dfs/data</value> </property> <property> <name>dfs.datanode.socket.write.timeout</name> <value>10000</value> </property> <property> <name>ipc.client.connect.timeout</name> <value>1000</value> </property> <property> <name>ipc.client.connect.max.retries.on.timeouts</name> <value>2</value> </property> <property> <name>dfs.socket.timeout</name> <value>5000</value> </property> <property> <name>dfs.socket.write.timeout</name> <value>5000</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/ var /lib/hadoop/dn_socket._PORT</value> </property> <property> <name>dfs.block.local-path-access.user</name> <value>hbase</value> </property> <property> <name>dfs.client.read.shortcircuit.skip.checksum</name> <value> true </value> </property> <property> <name>dfs.client.file-block-storage-locations.timeout</name> <value>3000</value> </property> </configuration>
        Hide
        Elliott Clark added a comment -

        Here's what I tried and it make the tests go green.

        Show
        Elliott Clark added a comment - Here's what I tried and it make the tests go green.
        Hide
        stack added a comment -

        +1

        We can work on speeding up MTTR elsewhere.

        Change 'Default: 10.' to 30 o commit.

        Show
        stack added a comment - +1 We can work on speeding up MTTR elsewhere. Change 'Default: 10.' to 30 o commit.
        Hide
        Sergey Shelukhin added a comment -

        how environment-specific is this? I am +1 assuming it's for a typical cluster

        Show
        Sergey Shelukhin added a comment - how environment-specific is this? I am +1 assuming it's for a typical cluster
        Hide
        Elliott Clark added a comment -

        Pretty typical clusters unfortunately. I've had this happen consistently on 10 node vm cluster and a 7 node hardware cluster (6 spindles each)

        Show
        Elliott Clark added a comment - Pretty typical clusters unfortunately. I've had this happen consistently on 10 node vm cluster and a 7 node hardware cluster (6 spindles each)
        Hide
        Elliott Clark added a comment -

        Patch that I'll commit.

        • Fixed the typo that stack found.
        Show
        Elliott Clark added a comment - Patch that I'll commit. Fixed the typo that stack found.
        Hide
        Elliott Clark added a comment -

        Committed to trunk and 0.95

        Thanks for the reviews.

        Show
        Elliott Clark added a comment - Committed to trunk and 0.95 Thanks for the reviews.
        Hide
        Sergey Shelukhin added a comment -

        sorry, what I meant is cluster with typical failures. Are such lengthy retries only good for CM, or would they actually make sense in some production env

        Show
        Sergey Shelukhin added a comment - sorry, what I meant is cluster with typical failures. Are such lengthy retries only good for CM, or would they actually make sense in some production env
        Hide
        Elliott Clark added a comment -

        I've seen the tests fail when just the RS holding meta failed. So I would say the failures weren't too un-reasonable.

        We're seeing failure to open taking ~50 seconds which is outside of what the defaults were (34ish seconds of reties before).

        Show
        Elliott Clark added a comment - I've seen the tests fail when just the RS holding meta failed. So I would say the failures weren't too un-reasonable. We're seeing failure to open taking ~50 seconds which is outside of what the defaults were (34ish seconds of reties before).
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #4171 (See https://builds.apache.org/job/HBase-TRUNK/4171/)
        HBASE-8723 HBase Intgration tests are failing because of new defaults. (Revision 1491640)

        Result = SUCCESS
        eclark :
        Files :

        • /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/trunk/hbase-common/src/main/resources/hbase-default.xml
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #4171 (See https://builds.apache.org/job/HBase-TRUNK/4171/ ) HBASE-8723 HBase Intgration tests are failing because of new defaults. (Revision 1491640) Result = SUCCESS eclark : Files : /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/trunk/hbase-common/src/main/resources/hbase-default.xml
        Hide
        Hudson added a comment -

        Integrated in hbase-0.95 #235 (See https://builds.apache.org/job/hbase-0.95/235/)
        HBASE-8723 HBase Intgration tests are failing because of new defaults. (Revision 1491645)

        Result = FAILURE
        eclark :
        Files :

        • /hbase/branches/0.95/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java
        • /hbase/branches/0.95/hbase-common/src/main/resources/hbase-default.xml
        Show
        Hudson added a comment - Integrated in hbase-0.95 #235 (See https://builds.apache.org/job/hbase-0.95/235/ ) HBASE-8723 HBase Intgration tests are failing because of new defaults. (Revision 1491645) Result = FAILURE eclark : Files : /hbase/branches/0.95/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java /hbase/branches/0.95/hbase-common/src/main/resources/hbase-default.xml

          People

          • Assignee:
            Elliott Clark
            Reporter:
            Elliott Clark
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development