Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-18786

HDP Upgrade fails when the cluster size is large

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.5.0
    • ambari-server
    • None

    Description

      Starting from Ambari 2.4, when the cluster is large, HDP upgrade fails during namenode restart.

      This is because, restart command waits for namenode to come out of safemode and if the cluster size is large, namenode takes more time to leave safemode but Ambari marks this action as failure as the namenode didn't leave safemode within the configured timeout in Ambari scripts.

      Traceback (most recent call last):
      File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 42, in get_value_from_jmx
      return data_dict["beans"][0][property]
      IndexError: list index out of range
      Traceback (most recent call last):
      File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 420, in <module>
      NameNode().execute()
      File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 280, in execute
      method(env)
      File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 720, in restart
      self.start(env, upgrade_type=upgrade_type)
      File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start
      upgrade_suspended=params.upgrade_suspended, env=env)
      File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
      return fn(*args, **kwargs)
      File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 184, in namenode
      if is_this_namenode_active() is False:
      File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 55, in wrapper
      return function(*args, **kwargs)
      File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 554, in is_this_namenode_active
      raise Fail(format("The NameNode {namenode_id} is not listed as Active or Standby, waiting..."))
      resource_management.core.exceptions.Fail: The NameNode nn1 is not listed as Active or Standby, waiting...
      

      To resolve this, we increased the timeout for ambari

      1. Increased the timeout in /var/lib/ambari-server/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py from this;
      @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail)
      to this;
      @retry(times=25, sleep_time=25, backoff_factor=2, err_class=Fail)

      2. Restart Ambari server

      After this upgrade went through fine.

      I think its better to increase the timeout permanently so that we don't have to deal with this issue again.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dmitriusan Dmitry Lysnichenko Assign to me
            dmitriusan Dmitry Lysnichenko
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment