Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3266

RMContext inactiveNodes should have NodeId as map key

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0
    • 2.8.0, 3.0.0-alpha1
    • resourcemanager
    • None
    • Reviewed

    Description

      Under the default NM port configuration, which is 0, we have observed in the current version, "lost nodes" count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice:

      • NM started at port 10001
      • NM restarted at port 10002
      • NM restarted at port 10003
      • NM:10001 timeout, ClusterMetrics#incrNumLostNMs(), # lost node=1; rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode), inactiveNodes has 1 element
      • NM:10002 timeout, ClusterMetrics#incrNumLostNMs(), # lost node=2; rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode), inactiveNodes still has 1 element

      Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), inactiveNodes should be of type ConcurrentMap<NodeId, RMNode>. If this will break the current API, then the key string should include the NM's port as well.

      Thoughts?

      Attachments

        1. YARN-3266.01.patch
          5 kB
          Chengbing Liu
        2. YARN-3266.02.patch
          9 kB
          Chengbing Liu
        3. YARN-3266.03.patch
          15 kB
          Chengbing Liu

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            chengbing.liu Chengbing Liu
            chengbing.liu Chengbing Liu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment