Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-19410

Node failure in case multiple nodes join and leave a cluster simultaneously with security is enabled.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.16
    • security
    • Fixed node crash due to SecurityContext not being found during discovery message processing.
    • Release Notes Required

    Description

      The case when nodes with security enabled join and leave the cluster simultaneously can cause the joining nodes to fail with the following exception:

      [2023-05-03T14:54:31,208][ERROR][disco-notifier-worker-#332%ignite.NodeSecurityContextTest2%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Failed to find security context for subject with given ID : 4725544a-f144-4486-a705-46b2ac200011]]
       java.lang.IllegalStateException: Failed to find security context for subject with given ID : 4725544a-f144-4486-a705-46b2ac200011
          at org.apache.ignite.internal.processors.security.IgniteSecurityProcessor.withContext(IgniteSecurityProcessor.java:164) ~[classes/:?]
          at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$3$SecurityAwareNotificationTask.run(GridDiscoveryManager.java:949) ~[classes/:?]
          at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body0(GridDiscoveryManager.java:2822) ~[classes/:?]
          at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body(GridDiscoveryManager.java:2860) [classes/:?]
          at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) [classes/:?]
          at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351] 

      Reproducer is attached.

      Simplified steps that leads to the failure:

      1. The client node sends an arbitrary discovery message which produces an acknowledgement message when it processed by the all cluster nodes .
      2. The client node gracefully leaves the cluster.
      3. The new node joins the cluster and receives a topology snapshot that does not include the left client node.
      4. The new node receives an acknowledgment for the message from the step 1 and fails during its processing because message originator node is not listed in the current discovery cache or discovery cache history (see IgniteSecurityProcessor#withContext(java.util.UUID)) . This is because currently the GridDiscoveryManager#historicalNode method only aware of the topology history that occurs after a node has joined the cluster. The complete cluster topology history that exists at the time a new node joined the cluster is stored in GridDiscoveryManager#topHist and is not taken into account by the GridDiscoveryManager#historicalNode method.

       

      Attachments

        1. NodeSecurityContextTest.java
          18 kB
          Mikhail Petrov

        Issue Links

          Activity

            People

              PetrovMikhail Mikhail Petrov
              PetrovMikhail Mikhail Petrov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h