[IGNITE-22514] Failed to get the primary replica if non CMG node is down - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 3.0
Fix Version/s: None
Component/s: general, jdbc, networking, persistence
Labels:
- ignite-3
Environment:

The 2 nodes cluster (1 CMG node).

Ignite Flags:

Docs Required, Release Notes Required

Description

Steps to reproduce:

Start cluster of 2 nodes with one CMG node.
Create zone with replication equals to amount of nodes (2).
Create 10 tables inside the zone.
Insert 100 rows in every table.
Await all tables*partitions*nodes local state is "HEALTHY"
Await all tables*partitions*nodes global state is "AVAILABLE"
Kill non CMG node with kill -9.
Assert physical topology contains only 1 alive node.
Assert logical topology contains only 1 alive node.
Await all tables*partitions*nodes local state is "HEALTHY"
Await all tables*partitions*nodes global state is "READ_ONLY".
Execute select query using JDBC connecting to the alive CMG node.

Expected:

Data is returned.

Actual:

The exception on step 12 occurs:

Failed to get the primary replica [tablePartitionId=10_part_1]
java.sql.SQLException: Failed to get the primary replica [tablePartitionId=10_part_1]
    at org.apache.ignite.internal.jdbc.proto.IgniteQueryErrorCode.createJdbcSqlException(IgniteQueryErrorCode.java:57)
    at org.apache.ignite.internal.jdbc.JdbcStatement.execute0(JdbcStatement.java:154)
    at org.apache.ignite.internal.jdbc.JdbcStatement.executeQuery(JdbcStatement.java:111)
    at org.gridgain.ai3tests.tests.teststeps.JdbcSteps.executeQuery(JdbcSteps.java:91)
    at org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.getActualResult(ClusterFailoverTestBase.java:338)
    at org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.assertDataIsFilledWithoutErrors(ClusterFailoverTestBase.java:169)
    at org.gridgain.ai3tests.tests.failover.ClusterFailover2NodesTest.singleKillAndCheckOtherNodeWorks(ClusterFailover2NodesTest.java:123)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)

In the server logs continuous errors:

2024-06-14 18:10:58:719 +0000 [WARNING][%ClusterFailover2NodesTest_cluster_0%Raft-Group-Client-7][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=ReadIndexRequestImpl [entriesList=null, groupId=28_part_1, peerId=ClusterFailover2NodesTest_cluster_1, serverId=ClusterFailover2NodesTest_cluster_1], peer=Peer [consistentId=ClusterFailover2NodesTest_cluster_1, idx=0], newPeer=Peer [consistentId=ClusterFailover2NodesTest_cluster_1, idx=0]].
java.util.concurrent.CompletionException: java.net.ConnectException: Peer ClusterFailover2NodesTest_cluster_1 is unavailable
  at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
  at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1099)
  at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
  at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:558)
  at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$handleErrorResponse$44(RaftGroupServiceImpl.java:653)
  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Peer ClusterFailover2NodesTest_cluster_1 is unavailable
  at org.apache.ignite.internal.raft.RaftGroupServiceImpl.resolvePeer(RaftGroupServiceImpl.java:806)
  at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:557)
  ... 7 more

Server logs are in the attachments.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CMG node.zip
14/Jun/24 18:43
3.83 MB
Igor
non CMG killed node.zip
14/Jun/24 18:43
54 kB
Igor

Issue Links

is superceded by

IGNITE-23087 Wrong partitions status if 1 node of 2 nodes cluster is down

Open

supercedes

IGNITE-22187 Cluster of 2 or 3 nodes doesn't work if one node is down

Resolved

Failed to get the primary replica if non CMG node is down

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates