We have a cluster with server and client nodes.
We dynamically start several caches on a cluster.
Periodically we create and destroy some temporary cache in a cluster to move up cluster topology version.
At the same time, a random client node chooses a random existing cache and performs operations on that cache.
It leads to an exception on client node that affinity is not initialized for a cache during cache operation like:
Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]
This exception means that the last affinity for a cache is calculated on version [8,2]. This is a cache start version. It happens because during creating/destroying some temporary cache we don’t re-calculate affinity for all existing but not already accessed caches on client nodes. Re-calculate in this case is cheap - we just copy affinity assignment and increment topology version.
As a solution, we need to fetch affinity on client node join for all caches. Also, we need to re-calculate affinity for all affinity holders (not only for started caches or only configured caches) for all topology events that happened in a cluster on a client node.
This solution showed the existing race between client node join and concurrent cache destroy.
The race is the following:
Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator during client PME. This SingleMessage contains affinity fetch requests for all cluster caches. When SingleMessage is in-flight server nodes finish client PME and also process and finish cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage delivered to coordinator it doesn’t have affinity for a requested cache because the cache is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior on the client node.
The race may be fixed with the following change:
If the coordinator doesn’t have an affinity for requested cache from the client node, it doesn’t break PME with assertion error, just doesn’t send affinity for that cache to a client node. When the client node receives FullMessage and sees that affinity for some requested cache doesn’t exist, it just closes cache proxy for user interactions which throws CacheStopped exception for every attempt to use that cache. This is safe behavior because cache destroy event should be happened on the client node soon and destroy that cache completely.