Details
Description
During cluster deactivation we force checkpoint (with "caches stop" reason) and remove checkpoint listeners before actual caches stop. But if there are some activity with data pages on the node after that checkpoint, but before caches stops and next checkpoint is started, the storage can be corrupted.
Reproducer:
/** {@inheritDoc} */ @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception { return super.getConfiguration(igniteInstanceName) .setDataStorageConfiguration(new DataStorageConfiguration() .setDefaultDataRegionConfiguration(new DataRegionConfiguration().setPersistenceEnabled(true)) .setCheckpointFrequency(1_000L)) .setFailureHandler(new StopNodeFailureHandler()); } /** */ @Test public void testCpAfterClusterDeactivate() throws Exception { IgniteEx ignite0 = startGrid(0); IgniteEx ignite1 = startGrid(1); ignite0.cluster().state(ClusterState.ACTIVE); ignite0.getOrCreateCache(new CacheConfiguration<>(DEFAULT_CACHE_NAME).setBackups(1) .setAffinity(new RendezvousAffinityFunction(false, 10))); try (IgniteDataStreamer<Integer, Integer> streamer = ignite0.dataStreamer(DEFAULT_CACHE_NAME)) { for (int i = 0; i < 100_000; i++) streamer.addData(i, i); } stopGrid(0); try (IgniteDataStreamer<Integer, Integer> streamer = ignite1.dataStreamer(DEFAULT_CACHE_NAME)) { streamer.allowOverwrite(true); for (int i = 0; i < 100_000; i++) streamer.addData(i, i + 1); } ignite0 = startGrid(0); ((GridCacheDatabaseSharedManager)ignite0.context().cache().context().database()).addCheckpointListener(new CheckpointListener() { @Override public void onMarkCheckpointBegin(Context ctx) { // No-op. } @Override public void onCheckpointBegin(Context ctx) { if ("caches stop".equals(ctx.progress().reason())) doSleep(1_000L); } @Override public void beforeCheckpointBegin(Context ctx) { // No-op. } }); ignite0.cluster().state(ClusterState.INACTIVE); doSleep(2_000L); ignite0.cluster().state(ClusterState.ACTIVE); IgniteCache<Integer, Integer> cache = ignite0.cache(DEFAULT_CACHE_NAME); for (int i = 0; i < 100_000; i++) assertEquals((Integer)(i + 1), cache.get(i)); }
This reproducer shuts down the node with some probability (about 1/5 on my laptop) on activation or on last check with CorruptedTreeException.
Attachments
Issue Links
- links to