Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3364

Clarify Naming of yarn.client.nodemanager-connect.max-wait-ms and yarn.resourcemanager.connect.max-wait.ms

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • yarn
    • None

    Description

      I encountered an issue recently where the ApplicationMaster for MapReduce jobs would spend hours attempting to connect to a node in my cluster that had died due to a hardware fault. After debugging this, I found that the yarn.client.nodemanager-connect.max-wait-ms property did not behave as I had expected. Based on the name I had thought this would set a maximum time limit for attempting to connect to a NodeManager. The code in org.apache.hadoop.yarn.client.NMProxy corroborated this thought - it used a RetryUpToMaximumTimeWithFixedSleep policy when a ConnectTimeoutException was thrown, as it was in my case with a dead node.

      However, the RetryUpToMaximumTimeWithFixedSleep policy doesn't actually set a time limit, but instead divides the maximum time by the sleep period to set a total number of retries, regardless of how long those retries take. As such I was seeing the ApplicationMaster spend much longer attempting to make a connection than I had anticipated.

      The yarn.resourcemanager.connect.max-wait.ms would have the same behavior. These properties would be better named like yarn.client.nodemanager-connect.max.retries and yarn.resourcemanager.connect.max.retries to better align with the actual behavior.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            ajsquared Andrew Johnson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment