Uploaded image for project: 'ActiveMQ Artemis'
  1. ActiveMQ Artemis
  2. ARTEMIS-3345

Shared-Nothing Replication Master loose Node ID on failed fail-back

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.17.0
    • None
    • Broker
    • None

    Description

      A failing-back master forget its Node ID and on broker restart, having a different Node ID, can become live without searching any existing live with its previous Node ID.

      This is happen because of this mechanics on SharedNothingBackupActivation:

      1. SharedNothingBackupActivation::init is going to call activeMQServer.resetNodeManager that's re-creating a NodeManager with replicatingBackup == true
      2. SharedNothingBackupActivation::run is then
                 // move all data away:
                 activeMQServer.getNodeManager().stop();
                 activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
                 activeMQServer.getNodeManager().start();
        

        The server data rotation just clean up everything on the data path, including the lock file.
        NodeManager::start, due to replicatingBackup == true is going to skip setting up a new lock file (no lock files at this point and by consequence, no durable NODE ID), see

           @Override
           public synchronized void start() throws Exception {
              if (isStarted()) {
                 return;
              }
              if (!replicatedBackup) {
                 setUpServerLockFile();
              }
        
              super.start();
           }
        
      3. the broker set an in-memory Node ID after a successful sync with the live, using NodeManager::setNodeID
      4. if the broker is going to failover (or failback, given that's a master) activeMQServer.getNodeManager().stopBackup() it setup a new lock file with the previously set Node ID, see
           @Override
           public void stopBackup() throws NodeManagerException {
              if (replicatedBackup && getNodeId() != null) {
                 try {
                    setUpServerLockFile();
                 } catch (IOException e) {
                    throw new NodeManagerException(e);
                 }
              }
              super.stopBackup();
           }
        

      This process shows that if anything wrong happen before the Node ID is written on the durable storage, because the broker was unable to become live (no majority or just still alive live) or due to a restart with unlucky timing, the broker won't have any lock file, forgetting its original Node ID.

      Attachments

        Issue Links

          Activity

            People

              nigrofranz Francesco Nigro
              nigrofranz Francesco Nigro
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: