[HBASE-28422] SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target RegionServer indefinitely - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.5.5
Fix Version/s: None
Component/s: master, proc-v2, wal
Labels:
None

Description

Similar to ~~HBASE-28050~~. If HMaster selects a RegionServer for SplitWalRemoteProcedure, it will retry this server as long as the server is alive. I believe this is because even though RSProcedureDispatcher.ExecuteProceduresRemoteCall.run calls remoteCallFailed, there is no logic after this to select a new target server. For TransitRegionStateProcedure there is logic to select a new server for opening a region, using forceNewPlan. But SplitWalRemoteProcedure only has logic to try another server if we receive a DoNotRetryIOException in SplitWALRemoteProcedure#complete: https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110

If we receive any other IOException, we will just retry the target server forever. Just like in ~~HBASE-28050~~, if there is a SaslException, this will never lead to retrying a SplitWalRemoteProcedure on a new server, which can lead to ServerCrashProcedure never finishing until the target server for SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, always sending to the same host.

2024-01-31 15:59:43,616 WARN  [RSProcedureDispatcher-pool-72846] procedure.SplitWALRemoteProcedure - Failed split of hdfs://<ns>/hbase/WALs/<host>,1704984571464-splitting/<host>1704984571464.1706710908543, retry...
java.io.IOException: Call to address=<host> failed on local exception: java.io.IOException: Can not send request because relogin is in progress.
	at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown Source)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
	at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114)
	at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129)
	at org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365)
	at org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
	at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Can not send request because relogin is in progress.
	at org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321)
	at org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363)
	... 8 more

Attachments

Issue Links

relates to

HBASE-28050 RSProcedureDispatcher to fail-fast for krb auth failures

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: David Manning

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Mar/24 17:39

Updated:: 06/Mar/24 13:46