Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-16589

Failure handler kills server node on getting affinity from old topology

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.13
    • None
    • Fixed an issue that led to failures of server nodes due to short history of affinity assignments.
    • Release Notes Required

    Description

      In general, the following exception seems to be a bit overkill

      [2022-02-21 10:34:53,347][ERROR][aff-#300%cache.CacheNoAffinityExchangeTest0%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=java.lang.IllegalStateException: Getting affinity for too old topology version that is already out of history [locNode=TcpDiscoveryNode [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], head=AffinityTopologyVersion [topVer=8, minorTopVer=0], history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]]]]]
      java.lang.IllegalStateException: Getting affinity for too old topology version that is already out of history [locNode=TcpDiscoveryNode [id=0917cb9d-2825-46eb-b210-1e2846f00000, consistentId=127.0.0.1:47500, addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1645428893228, loc=true, ver=2.13.0#20220218-sha1:7e63c212, isClient=false], grp=client-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], lastAffChangeTopVer=AffinityTopologyVersion [topVer=1, minorTopVer=2], head=AffinityTopologyVersion [topVer=8, minorTopVer=0], history=[AffinityTopologyVersion [topVer=7, minorTopVer=0], AffinityTopologyVersion [topVer=8, minorTopVer=0]]]
      	at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:849)
      	at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:796)
      	at org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest0(CacheGroupContext.java:1130)
      	at org.apache.ignite.internal.processors.cache.CacheGroupContext.processAffinityAssignmentRequest(CacheGroupContext.java:1116)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
      	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528)
      	at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
      	at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421)
      	at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      it seems that much more suitable way to handle this should be to kill/restart the client node or return an error for the client to handle or to attempt remapping the operation on the new topology. For now, it is not clear why exactly the node must be shut down after this exception and documentation doesn't explain how to avoid this.

      The root cause of the issue is that AffinityRequest from the client node refers to the "old" topology version, which has been already wiped on the server-side.
      The possible scenario is the following:

      • the client wants to get a proxy for a cache that is already started on server nodes
      • client started custom exchange task (see GridCacheProcessor.processCustomExchangeTask)
      • before sending AffinityRequest the client node hanged due to a long GC pause, for example
      • cluster topology changed multiple times during that GC pause (the number of changes is enough to clean up the old history of affinity assignments)
      • server node received AffinityRequest from the client and, unfortunately, cannot process it in the right way because of a lack of history.

      IMHO we can respond to the client with an "empty" AffinityResponse which should provide a cause of the problem.
      In that case, the client node may try to reconnect to the cluster (if PME is in progress), or retry an operation.

      Attachments

        Issue Links

          Activity

            People

              slava.koptilin Vyacheslav Koptilin
              slava.koptilin Vyacheslav Koptilin
              Vladislav Pyatkov Vladislav Pyatkov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m