Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11184

fenced active RM not failing over correctly in HA setup

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.3
    • None
    • resourcemanager
    • None

    Description

      We've observed an issue recently on a production cluster running 3.2.3 in which a fenced Resource Manager remains active, but does not communicate with the ZK state store, and therefore cannot function correctly. This did not occur while running 3.2.2 on the same cluster.

      In more detail, what seems to happen is: 

      1. The active RM gets a NodeExists error from ZK while storing an app in the state store. I suspect that this is caused by some transient connection issue that causes the first node creation request to succeed, but for the response to not reach the RM, triggering a duplicate request which fails with this error.

      2. Because of this error, the active RM is fenced.

      3. Because it is fenced, the active RM starts to transition to standby.

      4. However, the RM never fully transitions to standby. It never logs Transitioning RM to Standby mode from the run method of StandByTransitionRunnable: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195. Related to this, a jstack of the RM shows that thread being RUNNABLE, but evidently not making progress:

      So the RM doesn't work because it is fenced, but remains active, which causes an outage until a failover is manually initiated.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            Steven Rand Steven Rand

            Dates

              Created:
              Updated:

              Slack

                Issue deployment