Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3379

Log splitting slowed by repeated attempts at connecting to downed datanode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Duplicate
    • None
    • 0.92.0
    • wal
    • None

    Description

      Testing if I kill RS and DN on a node, log splitting takes longer as we doggedly try connecting to the downed DN to get WAL blocks. Here's the cycle I see:

      2010-12-21 17:34:48,239 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_900551257176291912_1203821 failed  because recovery from primary datanode 10.20.20.182:10010 failed 5 times.    Pipeline was 10.20.20.184:10010, 10.20.20.186:10010, 10.20.20.182:10010. Will retry...
      2010-12-21 17:34:50,240 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 0 time(s).
      2010-12-21 17:34:51,241 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 1 time(s).
      2010-12-21 17:34:52,241 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 2 time(s).
      2010-12-21 17:34:53,242 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 3 time(s).
      2010-12-21 17:34:54,243 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 4 time(s).
      2010-12-21 17:34:55,243 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 5 time(s).
      2010-12-21 17:34:56,244 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 6 time(s).
      2010-12-21 17:34:57,245 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 7 time(s).
      2010-12-21 17:34:58,245 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 8 time(s).
      2010-12-21 17:34:59,246 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020. Already tried 9 time(s).
      2010-12-21 17:34:59,246 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #5 from primary datanode 10.20.20.182:10010
      java.net.ConnectException: Call to /10.20.20.182:10020 failed on connection exception: java.net.ConnectException: Connection refused
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)
          at org.apache.hadoop.ipc.Client.call(Client.java:743)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
          at $Proxy8.getProtocolVersion(Unknown Source)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
      ...
      

      "because recovery from primary datanode" is done 5 times (hardcoded). Within these retries we'll do

      this.maxRetries = conf.getInt("ipc.client.connect.max.retries", 10);
      

      The hardcoding of 5 attempts we should get fixed and we should doc the ipc.client.connect.max.retries as important config. We should recommend bringing it down from default.

      Attachments

        Issue Links

          Activity

            People

              stack Michael Stack
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: