Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
3.0, 3.0.0-beta1
-
None
-
The 3 nodes cluster running locally.
-
Docs Required, Release Notes Required
Description
Steps to reproduce:
- Create zone with replication equals to amount of nodes (2 or 3 corresponding)
- Create 10 tables inside the zone.
- Insert 100 rows in every table.
- Await all tables*partitions*nodes local state is "HEALTHY"
- Await all tables*partitions*nodes global state is "AVAILABLE"
- Kill first node with kill -9.
- Create new node and attach it to cluster instead of killed one.
- Using REST API check physical topology until only 3 alive nodes will be in topology.
- Using REST API check logical topology until only 3 alive nodes will be in topology.
Expected:
Data is returned.
Actual:
On the step 9 the request freeze and throws :
org.gridgain.ai3tests.core.generated.restapi.invoker.ApiException: Message: java.net.SocketTimeoutException: timeout HTTP response code: 0 HTTP response body: null HTTP response headers: null at org.gridgain.ai3tests.core.generated.restapi.invoker.ApiClient.execute(ApiClient.java:1047) at org.gridgain.ai3tests.core.generated.restapi.api.TopologyApi.logicalWithHttpInfo(TopologyApi.java:174) at org.gridgain.ai3tests.core.generated.restapi.api.TopologyApi.logical(TopologyApi.java:154) at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.getTopology(TopologyUtils.java:121) at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.lambda$waitForTopology$0(TopologyUtils.java:74) at org.gridgain.ai3tests.core.utils.RetryUtils.retryOnAllowedException(RetryUtils.java:40) at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.waitForTopology(TopologyUtils.java:72) at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.waitForLogicalTopology(TopologyUtils.java:56) at org.gridgain.ai3tests.tests.failover.ClusterFailover3NodesTest.killNodeAndReplaceWithNewEmptyOne(ClusterFailover3NodesTest.java:155) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.net.SocketTimeoutException: timeout at okio.SocketAsyncTimeout.newTimeoutException(JvmOkio.kt:146) at okio.AsyncTimeout.access$newTimeoutException(AsyncTimeout.kt:161) at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:339) at okio.RealBufferedSource.indexOf(RealBufferedSource.kt:430) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:323) at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29) at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:180) at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:110) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at org.gridgain.ai3tests.core.generated.restapi.invoker.ApiClient$2.intercept(ApiClient.java:1457) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154) at org.gridgain.ai3tests.core.generated.restapi.invoker.ApiClient.execute(ApiClient.java:1043) ... 13 more Caused by: java.net.SocketTimeoutException: Read timed out at java.base/java.net.SocketInputStream.socketRead0(Native Method) at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115) at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168) at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140) at okio.InputStreamSource.read(JvmOkio.kt:93) at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:128) ... 33 more
In the server logs continuous errors:
2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][AbstractClientService] Fail to connect ClusterFailover3NodesTest_cluster_0, exception: java.net.ConnectException. 2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][ReplicatorGroupImpl] Fail to check replicator connection to peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower. 2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-15][AbstractClientService] Fail to connect ClusterFailover3NodesTest_cluster_0, exception: java.net.ConnectException. 2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-15][ReplicatorGroupImpl] Fail to check replicator connection to peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower. 2024-05-30 10:51:37:069 +0200 [WARNING][%ClusterFailover3NodesTest_cluster_1%Raft-Group-Client-6][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=ReadActionRequestImpl [command=GetCommandImpl [key=[97, 115, 115, 105, 103, 110, 109, 101, 110, 116, 115, 46, 112, 101, 110, 100, 105, 110, 103, 46, 50, 54, 95, 112, 97, 114, 116, 95, 56], revision=-1], groupId=metastorage_group, readOnlySafe=true], peer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0], newPeer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0]]. java.util.concurrent.CompletionException: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:558) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleThrowable$41(RaftGroupServiceImpl.java:605) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable at org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:806) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:557) ... 7 more 2024-05-30 10:51:37:069 +0200 [WARNING][%ClusterFailover3NodesTest_cluster_1%Raft-Group-Client-11][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=ReadActionRequestImpl [command=GetCommandImpl [key=[97, 115, 115, 105, 103, 110, 109, 101, 110, 116, 115, 46, 112, 101, 110, 100, 105, 110, 103, 46, 49, 56, 95, 112, 97, 114, 116, 95, 49, 48], revision=-1], groupId=metastorage_group, readOnlySafe=true], peer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0], newPeer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0]]. java.util.concurrent.CompletionException: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099) at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:558) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleThrowable$41(RaftGroupServiceImpl.java:605) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable at org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:806) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:557) ... 7 more
Attachments
Issue Links
- is part of
-
IGNITE-22187 Cluster of 2 or 3 nodes doesn't work if one node is down
- Resolved
- is superceded by
-
IGNITE-22517 Partitions stay DEGRADED after node replacement in 3 node cluster
- Open