Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-8166

stopGrid() hangs in some cases when node is invalidated and PDS is enabled

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.5
    • Fix Version/s: 2.5
    • Component/s: None
    • Labels:
      None

      Description

      Node invalidation via FailureProcessor can hang exchange-worker and stopGrid() when PDS is enabled.

      Reproducer (reproducer is racy, sometimes finished without hang):

      public class StopNodeHangsTest extends GridCommonAbstractTest {
          /** Offheap size for memory policy. */
          private static final int SIZE = 10 * 1024 * 1024;
      
          /** Page size. */
          static final int PAGE_SIZE = 2048;
      
          /** Number of entries. */
          static final int ENTRIES = 2_000;
      
          /** {@inheritDoc} */
          @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
              IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
      
              DataStorageConfiguration dsCfg = new DataStorageConfiguration();
      
              DataRegionConfiguration dfltPlcCfg = new DataRegionConfiguration();
      
              dfltPlcCfg.setName("dfltPlc");
              dfltPlcCfg.setInitialSize(SIZE);
              dfltPlcCfg.setMaxSize(SIZE);
              dfltPlcCfg.setPersistenceEnabled(true);
      
              dsCfg.setDefaultDataRegionConfiguration(dfltPlcCfg);
              dsCfg.setPageSize(PAGE_SIZE);
      
              cfg.setDataStorageConfiguration(dsCfg);
      
              cfg.setFailureHandler(new FailureHandler() {
                  @Override public boolean onFailure(Ignite ignite, FailureContext failureCtx) {
                      return true;
                  }
              });
      
              return cfg;
          }
      
          public void testStopNodeHangs() throws Exception {
              cleanPersistenceDir();
      
              IgniteEx ignite0 = startGrid(0);
              IgniteEx ignite1 = startGrid(1);
      
              ignite1.cluster().active(true);
      
              awaitPartitionMapExchange();
      
              IgniteCache cache = ignite1.getOrCreateCache("TEST");
      
              Map<Integer, Object> entries = new HashMap<>();
      
              for (int i = 0; i < ENTRIES; i++)
                  entries.put(i, new byte[PAGE_SIZE * 2 / 3]);
      
              cache.putAll(entries);
      
              ignite1.context().failure().process(new FailureContext(FailureType.CRITICAL_ERROR, null));
      
              stopGrid(0);
              stopGrid(1);
          }
      }
      

      stopGrid(1) waiting until exchange finished, exchange-worker waits on method GridCacheDatabaseSharedManager#checkpointReadLock for CheckpointProgressSnapshot#cpBeginFut, but this future is never done because db-checkpoint-thread got exception at GridCacheDatabaseSharedManager.Checkpointer#markCheckpointBegin thrown by FileWriteAheadLogManager#checkNode and leave method markCheckpointBegin before future is done (curr.cpBeginFut.onDone();)

        Attachments

          Activity

            People

            • Assignee:
              alex_pl Aleksey Plekhanov
              Reporter:
              alex_pl Aleksey Plekhanov
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: