Description
After the cluster restart you may see the following assertion:
java.lang.AssertionError: LWM after reserved: lwm=2030, reserved=2010, cntr=Counter [lwm=2030, missed=[], hwm=2030, reserved=2011] at org.apache.ignite.internal.processors.cache.PartitionUpdateCounterTrackingImpl.reserve(PartitionUpdateCounterTrackingImpl.java:270) at org.apache.ignite.internal.processors.cache.PartitionUpdateCounterErrorWrapper.reserve(PartitionUpdateCounterErrorWrapper.java:58) at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.getAndIncrementUpdateCounter(IgniteCacheOffheapManagerImpl.java:1594) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.getAndIncrementUpdateCounter(GridCacheOffheapManager.java:2483) at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.getAndIncrementUpdateCounter(GridDhtLocalPartition.java:942) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.calculatePartitionUpdateCounters(IgniteTxLocalAdapter.java:510) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareFuture.prepare0(GridDhtTxPrepareFuture.java:1356) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareFuture.mapIfLocked(GridDhtTxPrepareFuture.java:726) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareFuture.prepare(GridDhtTxPrepareFuture.java:1132) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.prepareAsyncLocal(GridNearTxLocal.java:4282) at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareColocatedTx(IgniteTxHandler.java:303) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.proceedPrepare(GridNearOptimisticTxPrepareFuture.java:565) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.prepareSingle(GridNearOptimisticTxPrepareFuture.java:392) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFuture.prepare0(GridNearOptimisticTxPrepareFuture.java:335) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFutureAdapter.prepareOnTopology(GridNearOptimisticTxPrepareFutureAdapter.java:205) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearOptimisticTxPrepareFutureAdapter.prepare(GridNearOptimisticTxPrepareFutureAdapter.java:129) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.prepareNearTxLocal(GridNearTxLocal.java:3946) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.commitNearTxLocalAsync(GridNearTxLocal.java:3994) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.optimisticPutFuture(GridNearTxLocal.java:3051) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.putAsync0(GridNearTxLocal.java:729) at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.putAsync(GridNearTxLocal.java:484) at org.apache.ignite.internal.processors.cache.GridCacheAdapter$20.op(GridCacheAdapter.java:2511) at org.apache.ignite.internal.processors.cache.GridCacheAdapter$20.op(GridCacheAdapter.java:2509) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:4284) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put0(GridCacheAdapter.java:2509) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2487) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2466) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.put(IgniteCacheProxyImpl.java:1332) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.put(GatewayProtectedCacheProxy.java:867) at org.apache.ignite.util.BrokenRebalanceTest.testCountersOnCrachRecovery(BrokenRebalanceTest.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.apache.ignite.testframework.junits.GridAbstractTest$6.run(GridAbstractTest.java:2506)
when clusted was under the load and has a counters gaps on restart.
Possible solution is to fix PartitionUpdateCounterTrackingImpl#reservedCntr calculation
from
long max = Math.max(val, curLwm);
to something like
long max = Math.max(val, curLwm); max = Math.max(max, highestAppliedCounter());
Another possible fix is to get rid or PartitionUpdateCounterTrackingImpl#reservedCntr and always use highestAppliedCounter().
This may slowdown the system, but look like a correct fix. So, some optimisation may be required.
Please make sure that counters are correct after the fix using idle_verify.
Attachments
Attachments
Issue Links
- Dependency
-
IGNITE-17793 Historical rebalance must use HWM instead of LWM to seek the proper checkpoint to avoid the data loss
- Resolved
- is related to
-
IGNITE-17496 LWM may be after HWM (reserved) on primary after the node restart
- Resolved
- links to