Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-9562

Destroyed cache that resurrected on an old offline node breaks PME

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.5
    • 2.7.6
    • cache
    • Fixed an issue where an outdated node with a destroyed cache caused the cluster to hang
    • Docs Required, Release Notes Required

    Description

      Given:
      2 nodes, persistence enabled.
      1) Stop 1 node
      2) Destroy cache through client
      3) Start stopped node

      When the stopped node joins to cluster it starts all caches that it has seen before stopping.
      If that cache was cluster-widely destroyed it leads to breaking the crash recovery process or PME.

      Root cause - we don't start/collect caches from the stopped node on another part of a cluster.

      In case of PARTITIONED cache mode that scenario breaks crash recovery:

      java.lang.AssertionError: AffinityTopologyVersion [topVer=-1, minorTopVer=0]
      
      	at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:696)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.updateLocal(GridDhtPartitionTopologyImpl.java:2449)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.afterStateRestored(GridDhtPartitionTopologyImpl.java:679)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2445)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2321)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1568)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1308)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1255)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:766)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2577)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2457)
      	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
      	at java.lang.Thread.run(Thread.java:748)
      

      In case of REPLICATED cache mode that scenario breaks PME coordinator process:

      [2018-09-12 18:50:36,407][ERROR][sys-#148%distributed.CacheStopAndRessurectOnOldNodeTest0%][GridCacheIoManager] Failed to process message [senderId=4b6fd0d4-b756-4a9f-90ca-f0ee25100001, messageType=class o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsSingleMessage]
      java.lang.AssertionError: 3080586
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.clientTopology(GridCachePartitionExchangeManager.java:815)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updatePartitionSingleMap(GridDhtPartitionsExchangeFuture.java:3621)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2439)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:137)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2261)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2249)
      	at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383)
      	at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353)
      	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2249)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1628)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1100(GridCachePartitionExchangeManager.java:141)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:368)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:332)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2999)
      	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2978)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
      	at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      As one of the solutions - we shouldn't start such caches on resurrected nodes.
      We should save caches changes history somewhere and cluster-widely spread it to joining nodes.
      In a case when cache was only stopped, we can do nothing and start it lately when cache start request received.
      In a case when cache was stopped & destroyed, we should clean persistence data for that cache.

      Attachments

        Issue Links

          Activity

            People

              EdShangGG Eduard Shangareev
              jokser Pavel Kovalenko
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m