Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-19111

Storage corruption if pages changed after last checkpoint during deactivation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.15
    • None
    • Fixed PDS corruption on checkpoint after deactivation
    • Release Notes Required

    Description

      During cluster deactivation we force checkpoint (with "caches stop" reason) and remove checkpoint listeners before actual caches stop. But if there are some activity with data pages on the node after that checkpoint, but before caches stops and next checkpoint is started, the storage can be corrupted.

      Reproducer:

          /** {@inheritDoc} */
          @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
              return super.getConfiguration(igniteInstanceName)
                  .setDataStorageConfiguration(new DataStorageConfiguration()
                      .setDefaultDataRegionConfiguration(new DataRegionConfiguration().setPersistenceEnabled(true))
                      .setCheckpointFrequency(1_000L))
                  .setFailureHandler(new StopNodeFailureHandler());
          }
      
          /** */
          @Test
          public void testCpAfterClusterDeactivate() throws Exception {
              IgniteEx ignite0 = startGrid(0);
              IgniteEx ignite1 = startGrid(1);
      
              ignite0.cluster().state(ClusterState.ACTIVE);
      
              ignite0.getOrCreateCache(new CacheConfiguration<>(DEFAULT_CACHE_NAME).setBackups(1)
                  .setAffinity(new RendezvousAffinityFunction(false, 10)));
      
              try (IgniteDataStreamer<Integer, Integer> streamer = ignite0.dataStreamer(DEFAULT_CACHE_NAME)) {
                  for (int i = 0; i < 100_000; i++)
                      streamer.addData(i, i);
              }
      
              stopGrid(0);
      
              try (IgniteDataStreamer<Integer, Integer> streamer = ignite1.dataStreamer(DEFAULT_CACHE_NAME)) {
                  streamer.allowOverwrite(true);
                  for (int i = 0; i < 100_000; i++)
                      streamer.addData(i, i + 1);
              }
      
              ignite0 = startGrid(0);
              ((GridCacheDatabaseSharedManager)ignite0.context().cache().context().database()).addCheckpointListener(new CheckpointListener() {
                  @Override public void onMarkCheckpointBegin(Context ctx) {
                      // No-op.
                  }
      
                  @Override public void onCheckpointBegin(Context ctx) {
                      if ("caches stop".equals(ctx.progress().reason()))
                          doSleep(1_000L);
                  }
      
                  @Override public void beforeCheckpointBegin(Context ctx) {
                      // No-op.
                  }
              });
      
              ignite0.cluster().state(ClusterState.INACTIVE);
      
              doSleep(2_000L);
      
              ignite0.cluster().state(ClusterState.ACTIVE);
      
              IgniteCache<Integer, Integer> cache = ignite0.cache(DEFAULT_CACHE_NAME);
      
              for (int i = 0; i < 100_000; i++)
                  assertEquals((Integer)(i + 1), cache.get(i));
          } 

      This reproducer shuts down the node with some probability (about 1/5 on my laptop) on activation or on last check with CorruptedTreeException.

      Attachments

        Issue Links

          Activity

            People

              alex_pl Aleksey Plekhanov
              alex_pl Aleksey Plekhanov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m