Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-22514

Failed to get the primary replica if non CMG node is down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0
    • None
    • The 2 nodes cluster (1 CMG node).

    • Docs Required, Release Notes Required

    Description

      Steps to reproduce:

      1. Start cluster of 2 nodes with one CMG node.
      2. Create zone with replication equals to amount of nodes (2).
      3. Create 10 tables inside the zone.
      4. Insert 100 rows in every table.
      5. Await all tables*partitions*nodes local state is "HEALTHY"
      6. Await all tables*partitions*nodes global state is "AVAILABLE"
      7. Kill non CMG node with kill -9.
      8. Assert physical topology contains only 1 alive node.
      9. Assert logical topology contains only 1 alive node.
      10. Await all tables*partitions*nodes local state is "HEALTHY"
      11. Await all tables*partitions*nodes global state is "READ_ONLY".
      12. Execute select query using JDBC connecting to the alive CMG node.

      Expected:

      Data is returned.

      Actual:

      The exception on step 12 occurs:

      Failed to get the primary replica [tablePartitionId=10_part_1]
      java.sql.SQLException: Failed to get the primary replica [tablePartitionId=10_part_1]
          at org.apache.ignite.internal.jdbc.proto.IgniteQueryErrorCode.createJdbcSqlException(IgniteQueryErrorCode.java:57)
          at org.apache.ignite.internal.jdbc.JdbcStatement.execute0(JdbcStatement.java:154)
          at org.apache.ignite.internal.jdbc.JdbcStatement.executeQuery(JdbcStatement.java:111)
          at org.gridgain.ai3tests.tests.teststeps.JdbcSteps.executeQuery(JdbcSteps.java:91)
          at org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.getActualResult(ClusterFailoverTestBase.java:338)
          at org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.assertDataIsFilledWithoutErrors(ClusterFailoverTestBase.java:169)
          at org.gridgain.ai3tests.tests.failover.ClusterFailover2NodesTest.singleKillAndCheckOtherNodeWorks(ClusterFailover2NodesTest.java:123)
          at java.base/java.lang.reflect.Method.invoke(Method.java:566)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:834)

      In the server logs continuous errors:

      2024-06-14 18:10:58:719 +0000 [WARNING][%ClusterFailover2NodesTest_cluster_0%Raft-Group-Client-7][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=ReadIndexRequestImpl [entriesList=null, groupId=28_part_1, peerId=ClusterFailover2NodesTest_cluster_1, serverId=ClusterFailover2NodesTest_cluster_1], peer=Peer [consistentId=ClusterFailover2NodesTest_cluster_1, idx=0], newPeer=Peer [consistentId=ClusterFailover2NodesTest_cluster_1, idx=0]].
      java.util.concurrent.CompletionException: java.net.ConnectException: Peer ClusterFailover2NodesTest_cluster_1 is unavailable
        at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
        at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
        at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:558)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$44(RaftGroupServiceImpl.java:653)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.net.ConnectException: Peer ClusterFailover2NodesTest_cluster_1 is unavailable
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:806)
        at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:557)
        ... 7 more

      Server logs are in the attachments.

      Attachments

        1. CMG node.zip
          3.83 MB
          Igor
        2. non CMG killed node.zip
          54 kB
          Igor

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lunigorn Igor
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: