This patch might be a bit radical, but here it goes.
High-level motivation is undo retrying and sleeps down in ipc; let retrying be done at a higher level up in the hbase client.
In ipc, socket setup had a timeout of 20 seconds. Ipc then retries the socket setup ten times with a 1 second sleep in between. Thats 210seconds or so before we timeout down in the guts of RPC. We then go up to the retry logic in hbase (usually, not always), and then do ten retries with a 2 second retry in between (If a SocketTimeoutException exception setting up the connection, we'd retry a hard-coded 45 times; i.e. 15 minutes).
In Justin's case, I don't think we were doing SocketTimeoutException going by the stack trace. It was more the 210 seconds per thread but my guess is that his thrift client had probably timed out already.
This patch turns off retry down in the ipc client (let the upper-layers do retry), changes hard-coded sleep times to be hbase.client.pause time (2 seconds), and removes the 45 hard-coding, It also adds an hbase prefix to the ipc configuration parameters in case we want different values from hadoop.
Let me try out this patch. My guess is that there are places in hbase where we don't retry because we were dependent on ipc doing retry for us. Let me find those.