Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-6256

When a node becomes segmented an AssertionError is thrown during GridDhtPartitionTopologyImpl.removeNode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.8
    • 2.3
    • general
    • None

    Description

      The assert is as follows:

      exception="java.lang.AssertionError: null
      at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422)
      at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490)
      at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769)
      at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504)
      at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689)
      at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
      at java.lang.Thread.run(Thread.java:745)

      Below is the sequence of steps that leads to the assertion error:

      1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after an EVT_NODE_FAILED has been received.
      2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
      3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node ring is used to determine if a node is alive
      during DiscoCache creation
      4) After that, the node initiates removal of all the nodes read in step 2
      5) For each node, it sends an EVT_NODE_FAILED to the corresponding DiscoverySpiListener
      providing a topology containing all the nodes except already processed
      6) This event gets into GridDiscoveryManager
      7) The node gets removed from alive nodes for every DiscoCache in discoCacheHist
      8) Topology change is detected
      9) Creation of a new DiscoCache is attempted. At this moment every remote node is not available due to the
      TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with empty alives
      10) The event with the created DiscoCache and the new topology version is passed to DiscoveryWorker
      11) The event is eventually handled by DiscoveryWorker and is recorded by DiscoveryWorker#recordEvent
      12) The recording is handled by GridEventStorageManager which notifies every listener for this event type (EVT_NODE_FAILED)
      13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
      It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache received with the event and enqueues it
      14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and initialized
      15) updateTopologies is called, which for each GridCacheContext gets its topology (GridDhtPartitionTopology)
      and calls GridDhtPartitionTopology#updateTopologyVersion
      16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the GridDhtPartitionsExchangeFuture.
      The assigned DiscoCache has empty alives at the moment
      15) A distributed exchange is handled (GridDhtPartitionsExchangeFuture#distributedExchange)
      16) For each cache context GridCacheContext, for its topology (GridDhtPartitionTopologyImpl) GridDhtPartitionTopologyImpl#beforeExchange is called
      17) The fact that the node has left is determined and GridDhtPartitionTopologyImpl#removeNode is called to handle it
      18) An attempt is made to get the alive coordinator node by calling DiscoCache#oldestAliveServerNode
      19) null is returned which results in an AssertionError

      The fix should probably prevent initiating exchange futures if a node has segmented.

      Attachments

        Activity

          People

            amashenkov Andrey Mashenkov
            asfedotov Alexandr Fedotov
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: