Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3698 Ozone Non-Rolling upgrades
  3. HDDS-5170

Race condition in NodestateManager#addNode allows datanodes with lower MLV to be used in pipelines

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      HDDS-4946 Introduced a race condition in NodeStateManager#addNode that allows SCM's background pipeline creator or another thread to read a node with a lower MLV than SCM as healthy before it is moved to the healthy readonly state.

      
        public void addNode(DatanodeDetails datanodeDetails,
            LayoutVersionProto layoutInfo) throws NodeAlreadyExistsException {
          NodeStatus newNodeStatus = newNodeStatus(datanodeDetails);
          nodeStateMap.addNode(datanodeDetails, newNodeStatus, layoutInfo);
          UUID dnID = datanodeDetails.getUuid();
          try {
            updateLastKnownLayoutVersion(datanodeDetails, layoutInfo);
            DatanodeInfo dnInfo = nodeStateMap.getNodeInfo(dnID);
            NodeStatus status = nodeStateMap.getNodeStatus(dnID);
      
            // State machine starts nodes as HEALTHY. If there is a layout
            // mismatch, this node should be moved to HEALTHY_READONLY.
            updateNodeLayoutVersionState(dnInfo, layoutMisMatchCondition, status,
                NodeLifeCycleEvent.LAYOUT_MISMATCH);
          } catch (NodeNotFoundException ex) {
            LOG.error("Inconsistent NodeStateMap! Datanode with ID {} was " +
                "added but not found in  map: {}", dnID, nodeStateMap);
          }
          eventPublisher.fireEvent(SCMEvents.NEW_NODE, datanodeDetails);
        }
      
      

      The node is added to the node state map (where other threads can view it) before its layout version information is updated.

      This manifests as an intermittent test failure in TestSCMNodeManager#testSCMLayoutOnRegister, which fails due to this condition after about 15-30 consecutive runs.

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: