Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-22376

RestAPI get logical topology freeze if 1 node was replaced in 3 nodes cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0, 3.0.0-beta1
    • None
    • The 3 nodes cluster running locally.

    • Docs Required, Release Notes Required

    Description

      Steps to reproduce:

      1. Create zone with replication equals to amount of nodes (2 or 3 corresponding)
      2. Create 10 tables inside the zone.
      3. Insert 100 rows in every table.
      4. Await all tables*partitions*nodes local state is "HEALTHY"
      5. Await all tables*partitions*nodes global state is "AVAILABLE"
      6. Kill first node with kill -9.
      7. Create new node and attach it to cluster instead of killed one.
      8. Using REST API check physical topology until only 3 alive nodes will be in topology.
      9. Using REST API check logical topology until only 3 alive nodes will be in topology.

      Expected:

      Data is returned.

      Actual:
      On the step 9 the request freeze and throws :

      org.gridgain.ai3tests.core.generated.restapi.invoker.ApiException: Message: java.net.SocketTimeoutException: timeout
      HTTP response code: 0
      HTTP response body: null
      HTTP response headers: null
          at org.gridgain.ai3tests.core.generated.restapi.invoker.ApiClient.execute(ApiClient.java:1047)
          at org.gridgain.ai3tests.core.generated.restapi.api.TopologyApi.logicalWithHttpInfo(TopologyApi.java:174)
          at org.gridgain.ai3tests.core.generated.restapi.api.TopologyApi.logical(TopologyApi.java:154)
          at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.getTopology(TopologyUtils.java:121)
          at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.lambda$waitForTopology$0(TopologyUtils.java:74)
          at org.gridgain.ai3tests.core.utils.RetryUtils.retryOnAllowedException(RetryUtils.java:40)
          at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.waitForTopology(TopologyUtils.java:72)
          at org.gridgain.ai3tests.core.ignite.topology.TopologyUtils.waitForLogicalTopology(TopologyUtils.java:56)
          at org.gridgain.ai3tests.tests.failover.ClusterFailover3NodesTest.killNodeAndReplaceWithNewEmptyOne(ClusterFailover3NodesTest.java:155)
          at java.base/java.lang.reflect.Method.invoke(Method.java:566)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.net.SocketTimeoutException: timeout
          at okio.SocketAsyncTimeout.newTimeoutException(JvmOkio.kt:146)
          at okio.AsyncTimeout.access$newTimeoutException(AsyncTimeout.kt:161)
          at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:339)
          at okio.RealBufferedSource.indexOf(RealBufferedSource.kt:430)
          at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:323)
          at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29)
          at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:180)
          at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:110)
          at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:93)
          at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
          at org.gridgain.ai3tests.core.generated.restapi.invoker.ApiClient$2.intercept(ApiClient.java:1457)
          at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
          at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
          at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
          at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
          at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
          at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
          at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
          at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
          at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
          at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
          at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
          at org.gridgain.ai3tests.core.generated.restapi.invoker.ApiClient.execute(ApiClient.java:1043)
          ... 13 more
      Caused by: java.net.SocketTimeoutException: Read timed out
          at java.base/java.net.SocketInputStream.socketRead0(Native Method)
          at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
          at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
          at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
          at okio.InputStreamSource.read(JvmOkio.kt:93)
          at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:128)
          ... 33 more
      

      In the server logs continuous errors:

      2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][AbstractClientService] Fail to connect ClusterFailover3NodesTest_cluster_0, exception: java.net.ConnectException.
      2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][ReplicatorGroupImpl] Fail to check replicator connection to peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower.
      2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-15][AbstractClientService] Fail to connect ClusterFailover3NodesTest_cluster_0, exception: java.net.ConnectException.
      2024-05-30 10:51:37:069 +0200 [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-15][ReplicatorGroupImpl] Fail to check replicator connection to peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower.
      2024-05-30 10:51:37:069 +0200 [WARNING][%ClusterFailover3NodesTest_cluster_1%Raft-Group-Client-6][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=ReadActionRequestImpl [command=GetCommandImpl [key=[97, 115, 115, 105, 103, 110, 109, 101, 110, 116, 115, 46, 112, 101, 110, 100, 105, 110, 103, 46, 50, 54, 95, 112, 97, 114, 116, 95, 56], revision=-1], groupId=metastorage_group, readOnlySafe=true], peer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0], newPeer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0]].
      java.util.concurrent.CompletionException: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
        at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
        at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:558)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleThrowable$41(RaftGroupServiceImpl.java:605)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:806)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:557)
        ... 7 more
      2024-05-30 10:51:37:069 +0200 [WARNING][%ClusterFailover3NodesTest_cluster_1%Raft-Group-Client-11][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=ReadActionRequestImpl [command=GetCommandImpl [key=[97, 115, 115, 105, 103, 110, 109, 101, 110, 116, 115, 46, 112, 101, 110, 100, 105, 110, 103, 46, 49, 56, 95, 112, 97, 114, 116, 95, 49, 48], revision=-1], groupId=metastorage_group, readOnlySafe=true], peer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0], newPeer=Peer [consistentId=ClusterFailover3NodesTest_cluster_0, idx=0]].
      java.util.concurrent.CompletionException: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
        at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
        at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:558)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleThrowable$41(RaftGroupServiceImpl.java:605)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.net.ConnectException: Peer ClusterFailover3NodesTest_cluster_0 is unavailable
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:806)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:557)
        ... 7 more
      

      Attachments

        Issue Links

          Activity

            People

              apolovtcev Aleksandr Polovtsev
              lunigorn Igor
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: