Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-7293

[replication] Remove dead sinks from ReplicationSource.currentPeers and pick new ones

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.94.3, 0.95.2
    • 0.94.5, 0.95.0
    • None
    • None
    • Reviewed

    Description

      I happened to look at a log today where I saw a lot lines like this:

      2012-12-06 23:29:08,318 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave cluster looks down: This server is in the failed servers list: sv4r20s49/10.4.20.49:10304
      2012-12-06 23:29:15,987 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't replicate because of a local or network error: 
      java.net.ConnectException: Connection refused
      	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
      	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
      	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:519)
      	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:484)
      	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:416)
      	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:462)
      	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1150)
      	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1000)
      	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
      	at $Proxy14.replicateLogEntries(Unknown Source)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:627)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
      2012-12-06 23:29:15,988 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave cluster looks down: Connection refused
      

      What struck me as weird is this had been going on for some days, I would expect the RS to find new servers if it wasn't able to replicate. But the reality is that only a few of the chosen sink RS were down so eventually the source hits one that's good and is never able to refresh its list of servers.

      We should remove the dead servers, it's spammy and probably adds some slave lag.

      Attachments

        1. 7293-0.94.txt
          3 kB
          Lars Hofhansl
        2. 7293-0.94-v2.txt
          1 kB
          Lars Hofhansl
        3. 7293-0.96.txt
          2 kB
          Lars Hofhansl

        Activity

          People

            larsh Lars Hofhansl
            jdcryans Jean-Daniel Cryans
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: