Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-17236

Namenode start step failed during EU with RetriableException

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: ambari-server
    • Labels:
      None

      Description

      Steps

      1. Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with custom service users)
      2. Upgrade Ambari to 2.4.0.0-644
      3. Register HDP-2.4.2.0 and install the bits
      4. Start Express Upgrade

      Observed below error during start of NameNode:

      Traceback (most recent call last):
        File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 414, in <module>
          NameNode().execute()
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 257, in execute
          method(env)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 679, in restart
          self.start(env, upgrade_type=upgrade_type)
        File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 101, in start
          upgrade_suspended=params.upgrade_suspended, env=env)
        File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
          return fn(*args, **kwargs)
        File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 216, in namenode
          create_hdfs_directories()
        File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 283, in create_hdfs_directories
          mode=0777,
        File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__
          self.env.run()
        File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run
          self.run_action(resource, action)
        File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_action
          provider_action()
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 458, in action_create_on_execute
          self.action_delayed("create")
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 455, in action_delayed
          self.get_hdfs_resource_executor().action_delayed(action_name, self)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 246, in action_delayed
          self._assert_valid()
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 230, in _assert_valid
          self.target_status = self._get_file_status(target)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 291, in _get_file_status
          list_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', ignore_status_codes=['404'], assertable_result=False)
        File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 191, in run_command
          raise Fail(err_msg)
      resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs'' returned status_code=403. 
      {
        "RemoteException": {
          "exception": "RetriableException", 
          "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
          "message": "NameNode still not started"
        }
      }
      

      So, the heart of this issue is that, depending on topology and upgrade type, we might not wait for NN to be out of Safe Mode after starting. However, we are always creating directories, regardless of topology/upgrade:

          # Always run this on non-HA, or active NameNode during HA.
          if is_active_namenode:
            create_hdfs_directories()
            create_ranger_audit_hdfs_directories()
      

      NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it didn't throw a retryable exception:

      [hdfs@c6403 root]$ hadoop fs -mkdir /foo
      mkdir: Cannot create directory /foo. Name node is in safe mode.
      

      So, it seems like we need to wait for NN to be out of Safe Mode no matter what.

        Attachments

        1. AMBARI-17236.patch
          11 kB
          Jonathan Hurley

          Issue Links

            Activity

              People

              • Assignee:
                jonathan.hurley Jonathan Hurley
                Reporter:
                jonathan.hurley Jonathan Hurley
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: