Details
Description
I was playing with master branch in distributed mode (3 rs + master + backup_master) and notice strange behavior when i was testing this sequence of events on single rs: /kill/start/run_balancer while client was writing data to cluster (LoadTestTool).
I have notice that LTT fails with following:
2015-09-09 11:05:58,364 INFO [main] client.AsyncProcess: #2, waiting for some tasks to finish. Expected max=0, tasksInProgress=35 Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: BindException: 1 time, at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:228) at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:208) at org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1697) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:211)
After some digging and adding some more logging in code i have notice that following condition in
AsyncRpcClient.removeConnection(AsyncRpcChannel connection)
is never true:
if (connectionInPool == connection) {
causing that
AsyncRpcChannel
connection is never removed from
connections
pool in case rs fails.
After changing this condition to:
if (connectionInPool.address.equals(connection.address)) {
issue was resolved and client was removing failed server from connections pool.
I will attach patch after running some more tests.