Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-22424

Insert and read errors under high cluster load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • None
    • None
    • None
    • Docs Required, Release Notes Required

    Description

      Setup

      • 1 or 3 server nodes cluster
      • 3 client nodes
      • 64+ threads per client performing KV put/get
      • Each node (server or client) is an AWS instance of c5d.4xlarge

      Steps

      • Client nodes do put 15m records
      • Client nodes do get 15m records

      Expected result

      No errors occurred, all records inserted and read after successfully.

      Actual result

      On some point, cluster degrades and throws errors on inserting/reading.

      Error examples:

      • The primary replica has changed
      org.apache.ignite.lang.IgniteException: The primary replica has changed [expectedLeaseholderName=poc-tester-SERVER-192.168.1.58-id-0, currentLeaseholderName=null, expectedLeaseholderId=b4425409-2d71-482d-a5d6-468522569cd0, currentLeaseholderId=null, expectedEnlistmentConsistencyToken=112569527664115719, currentEnlistmentConsistencyToken=null]
      	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) ~[?:?]
      	at org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:789) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:723) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.ClientUtils.copyExceptionWithCauseIfPossible(ClientUtils.java:73) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.ClientUtils.ensurePublicException(ClientUtils.java:54) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.ClientUtils.sync(ClientUtils.java:97) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:78) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:59) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at site.ycsb.db.ignite3.IgniteClient.read(IgniteClient.java:90) [ignite3-binding-2024.11.jar:?]
      	at site.ycsb.DBWrapper.read(DBWrapper.java:157) [core-2024.11.jar:?]
      	at site.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:803) [core-2024.11.jar:?]
      	at site.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:722) [core-2024.11.jar:?]
      	at site.ycsb.ClientThread.run(ClientThread.java:145) [core-2024.11.jar:?]
      	at java.lang.Thread.run(Thread.java:829) [?:?] 

       

      • Replication is timed out
      org.apache.ignite.tx.TransactionException: Replication is timed out [replicaGrpId=10_part_4]
      	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) ~[?:?]
      	at org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:789) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:723) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.ClientUtils.copyExceptionWithCauseIfPossible(ClientUtils.java:73) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.ClientUtils.ensurePublicException(ClientUtils.java:54) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.ClientUtils.sync(ClientUtils.java:97) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:78) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:59) ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
      	at site.ycsb.db.ignite3.IgniteClient.read(IgniteClient.java:90) [ignite3-binding-2024.11.jar:?]
      	at site.ycsb.DBWrapper.read(DBWrapper.java:157) [core-2024.11.jar:?]
      	at site.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:803) [core-2024.11.jar:?]
      	at site.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:722) [core-2024.11.jar:?]
      	at site.ycsb.ClientThread.run(ClientThread.java:145) [core-2024.11.jar:?]
      	at java.lang.Thread.run(Thread.java:829) [?:?] 

       

      • A critical thread is blocked for XXX ms that is more than the allowed 500 ms
      2024-06-04 22:26:49:031 +0300 [ERROR][%poc-tester-SERVER-192.168.1.119-id-0%common-scheduler-0][CriticalWorkerWatchdog] A critical thread is blocked for 512 ms that is more than the allowed 500 ms, it is "poc-tester-SERVER-192.168.1.119-id-0-client-32" prio=10 Id=122 RUNNABLE (in native)
          at java.base@11.0.12/sun.nio.ch.EPoll.wait(Native Method)
          at java.base@11.0.12/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
          at java.base@11.0.12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
          -  locked io.netty.channel.nio.SelectedSelectionKeySet@71be3d25
          -  locked sun.nio.ch.EPollSelectorImpl@5aebdd7
          at java.base@11.0.12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:141)
          at app//io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
          at app//io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:879)
          at app//io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:526)
          at app//io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
          at app//io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
          at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.base@11.0.12/java.lang.Thread.run(Thread.java:829)

       

      • Recoverable error during the request occurred (will be retried on the randomly selected node)
      2024-06-04 22:26:49:306 +0300 [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=WriteActionRequestImpl [command=[0, 9, 41, -117, -128, -40, -77, -3, -106, -7, -57, 1], deserializedCommand=SafeTimeSyncCommandImpl [safeTimeLong=112559993935757322], groupId=10_part_10], peer=Peer [consistentId=poc-tester-SERVER-192.168.1.119-id-0, idx=0], newPeer=Peer [consistentId=poc-tester-SERVER-192.168.1.119-id-0, idx=0]].
      java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
          at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
          at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
          at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
          at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
          at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
          at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.util.concurrent.TimeoutException 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              n.e.sivkov Nikita Sivkov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m