Description
Similar to HBASE-28050. If HMaster selects a RegionServer for SplitWalRemoteProcedure, it will retry this server as long as the server is alive. I believe this is because even though RSProcedureDispatcher.ExecuteProceduresRemoteCall.run calls remoteCallFailed, there is no logic after this to select a new target server. For TransitRegionStateProcedure there is logic to select a new server for opening a region, using forceNewPlan. But SplitWalRemoteProcedure only has logic to try another server if we receive a DoNotRetryIOException in SplitWALRemoteProcedure#complete: https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110
If we receive any other IOException, we will just retry the target server forever. Just like in HBASE-28050, if there is a SaslException, this will never lead to retrying a SplitWalRemoteProcedure on a new server, which can lead to ServerCrashProcedure never finishing until the target server for SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, always sending to the same host.
2024-01-31 15:59:43,616 WARN [RSProcedureDispatcher-pool-72846] procedure.SplitWALRemoteProcedure - Failed split of hdfs://<ns>/hbase/WALs/<host>,1704984571464-splitting/<host>1704984571464.1706710908543, retry... java.io.IOException: Call to address=<host> failed on local exception: java.io.IOException: Can not send request because relogin is in progress. at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365) at org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) at org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403) at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: Can not send request because relogin is in progress. at org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321) at org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363) ... 8 more
Attachments
Issue Links
- relates to
-
HBASE-28050 RSProcedureDispatcher to fail-fast for krb auth failures
- Resolved