Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-12255

Cache affinity fetching and calculation on client nodes may be broken in some cases

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.5, 2.7
    • Fix Version/s: 2.8
    • Component/s: cache
    • Labels:
      None
    • Release Note:
      Fixed affinity calculation and fetching on client nodes
    • Ignite Flags:
      Release Notes Required

      Description

      We have a cluster with server and client nodes.
      We dynamically start several caches on a cluster.
      Periodically we create and destroy some temporary cache in a cluster to move up cluster topology version.
      At the same time, a random client node chooses a random existing cache and performs operations on that cache.
      It leads to an exception on client node that affinity is not initialized for a cache during cache operation like:
      Affinity for topology version is not initialized [topVer = 8:10, head = 8:2]

      This exception means that the last affinity for a cache is calculated on version [8,2]. This is a cache start version. It happens because during creating/destroying some temporary cache we don’t re-calculate affinity for all existing but not already accessed caches on client nodes. Re-calculate in this case is cheap - we just copy affinity assignment and increment topology version.

      As a solution, we need to fetch affinity on client node join for all caches. Also, we need to re-calculate affinity for all affinity holders (not only for started caches or only configured caches) for all topology events that happened in a cluster on a client node.

      This solution showed the existing race between client node join and concurrent cache destroy.

      The race is the following:

      Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator during client PME. This SingleMessage contains affinity fetch requests for all cluster caches. When SingleMessage is in-flight server nodes finish client PME and also process and finish cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage delivered to coordinator it doesn’t have affinity for a requested cache because the cache is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior on the client node.

      The race may be fixed with the following change:

      If the coordinator doesn’t have an affinity for requested cache from the client node, it doesn’t break PME with assertion error, just doesn’t send affinity for that cache to a client node. When the client node receives FullMessage and sees that affinity for some requested cache doesn’t exist, it just closes cache proxy for user interactions which throws CacheStopped exception for every attempt to use that cache. This is safe behavior because cache destroy event should be happened on the client node soon and destroy that cache completely.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jokser Pavel Kovalenko
                Reporter:
                jokser Pavel Kovalenko
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m