Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-17182

App timeline Server start fails on enabling HA because namenode is in safemode

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: None
    • Labels:
    • Flags:
      Patch

      Description

      On the last step "Start all" on enabling HA below happens:

      Traceback (most recent call last):
        File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", line 147, in <module>
          ApplicationTimelineServer().execute()
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
          method(env)
        File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", line 43, in start
          self.configure(env) # FOR SECURITY
        File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py", line 54, in configure
          yarn(name='apptimelineserver')
        File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
          return fn(*args, **kwargs)
        File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py", line 276, in yarn
          mode=0755
        File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 154, in __init__
          self.env.run()
        File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run
          self.run_action(resource, action)
        File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action
          provider_action()
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 463, in action_create_on_execute
          self.action_delayed("create")
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 460, in action_delayed
          self.get_hdfs_resource_executor().action_delayed(action_name, self)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 259, in action_delayed
          self._set_mode(self.target_status)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 366, in _set_mode
          self.util.run_command(self.main_resource.resource.target, 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 195, in run_command
          raise Fail(err_msg)
      resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X PUT 'http://os-s11-3-iavzl-nat-s-ru242to25susesecha-12.openstacklocal:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755'' returned status_code=403. 
      {
        "RemoteException": {
          "exception": "RetriableException", 
          "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
          "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total blocks 697.\nThe number of live datanodes 20 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached."
        }
      }
      

      This happens because NN is not yet out of safemode at the moment of ats start, because DNs just started.

      To fix this "stop namenodes" has to be triggered before "start all".

      If this is done, on "Start all" it will be ensured that datanodes start prior to NN, and that NN are out of safemode before ATS start.

        Attachments

        1. nnha_fix.patch
          2 kB
          Victor Galgo

          Issue Links

            Activity

              People

              • Assignee:
                vgalgo Victor Galgo
                Reporter:
                vgalgo Victor Galgo
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: