Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-19435

NodeManager restart fails during HOU if it is on same host as RM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.5.0
    • 2.5.0
    • ambari-server
    • None

    Description

      Steps

      1. Deploy HDP-2.5.0.0 cluster with Ambari-2.5.0.0 - 4 node cluster with NodeManager installed on all hosts, NN HA is enabled, RM HA is not enabled
      2. Register 2.5.3.0 version and install the bits
      3. Start HOU using API and accept manual prompts to sys-prep the hosts. Observe the wizard at restart task of host that runs RM and NM together

      Result:
      At the task to Restart Node Manager on the RM host, observed below failure:

      2016-12-20 18:32:39,446 - File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': ['delete'], 'not_if': 'ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'}
      2016-12-20 18:32:39,459 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/2.5.3.0-37/hadoop/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/2.5.3.0-37/hadoop/conf start nodemanager'] {'not_if': 'ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'user': 'yarn'}
      2016-12-20 18:32:40,558 - Execute['ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'not_if': 'ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid', 'tries': 5, 'try_sleep': 1}
      2016-12-20 18:32:40,576 - Skipping Execute['ambari-sudo.sh  -H -E test -f /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid && ambari-sudo.sh  -H -E pgrep -F /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] due to not_if
      2016-12-20 18:32:40,576 - Executing NodeManager Stack Upgrade post-restart
      2016-12-20 18:32:40,578 - NodeManager executing "yarn node -list -states=RUNNING" to verify the node has rejoined the cluster...
      2016-12-20 18:32:40,578 - checked_call['yarn node -list -states=RUNNING'] {'user': 'yarn'}
      
      Command failed after 1 tries
      

      A retry of the failed task is successful.

      The issue looks due to the fact that RM is still down while we try to start NM on the host. While starting NM, we run below command to verify if NM has come up

      yarn node -list -states=RUNNING
      

      The command fails since it tries to connect to RM, resulting in timeout
      As a possible fix, we may need to adjust the order in HOU upgrade pack so as to start RM before NM in such cases.

      Attachments

        1. AMBARI-19435.patch
          56 kB
          Jonathan Hurley

        Issue Links

          Activity

            People

              jonathanhurley Jonathan Hurley
              shavi71 Vivek Sharma
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: