Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-18262

When Enabling NameNode HA Via the UI Wizard, the Second NN Fails to Start

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.4.0
    • trunk, 2.4.1
    • ambari-server
    • None

    Description

      Caused by: AMBARI-18240

      In enable namenode HA wizard, failure happened at "Start Additional NameNode" step.

      The first NameNode starts...

       "href" : "https://172.22.115.113:8443/api/v1/clusters/cl1/requests/46/tasks/368",
        "Tasks" : {
          "attempt_cnt" : 1,
          "cluster_name" : "cl1",
          "command" : "START",
          "command_detail" : "NAMENODE START",
          "end_time" : 1472080011602,
          "error_log" : "/var/lib/ambari-agent/data/errors-368.txt",
          "exit_code" : 0,
          "host_name" : "nat-sp12-rnqs-amb-views-ha-6-5.openstacklocal",
          "id" : 368,
          "output_log" : "/var/lib/ambari-agent/data/output-368.txt",
          "request_id" : 46,
          "role" : "NAMENODE",
          "stage_id" : 0,
          "start_time" : 1472079963470,
          "status" : "COMPLETED",
          "stderr" : "2016-08-24 23:06:11,102 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-5.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 42, in get_value_from_jmx\n    return data_dict[\"beans\"][0][property]\nIndexError: list index out of range\n2016-08-24 23:06:14,332 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\", line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate -u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmprdewEy 2>/tmp/tmpAmLket' returned 7. \n\n2016-08-24 23:06:22,280 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\", line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate -u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpHKH50b 2>/tmp/tmp6yyuWH' returned 7. \n\n2016-08-24 23:06:30,637 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\", line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate -u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpCXMjfH 2>/tmp/tmpq103ei' returned 7. \n\n2016-08-24 23:06:39,495 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\", line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate -u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpvdE9iJ 2>/tmp/tmpy9eAby' returned 7. \n\n2016-08-24 23:06:47,584 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 38, in get_value_from_jmx\n    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\", line 61, in get_user_call_output\n    raise Fail(err_msg)\nFail: Execution of 'curl --negotiate -u : -s 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmp0Jx91E 2>/tmp/tmp6qu0gW' returned 7.",
      

      The second does not:

      {
        "href" : "https://172.22.115.113:8443/api/v1/clusters/cl1/requests/47/tasks/369",
        "Tasks" : {
          "attempt_cnt" : 1,
          "cluster_name" : "cl1",
          "command" : "START",
          "command_detail" : "NAMENODE START",
          "end_time" : 1472080160611,
          "error_log" : "/var/lib/ambari-agent/data/errors-369.txt",
          "exit_code" : 1,
          "host_name" : "nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal",
          "id" : 369,
          "output_log" : "/var/lib/ambari-agent/data/output-369.txt",
          "request_id" : 47,
          "role" : "NAMENODE",
          "stage_id" : 0,
          "start_time" : 1472080026015,
          "status" : "FAILED",
          "stderr" : "2016-08-24 23:07:13,642 - Getting jmx metrics from NN failed. URL: http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback (most recent call last):\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\", line 42, in get_value_from_jmx\n    return data_dict[\"beans\"][0][property]\nIndexError: list index out of range\nTraceback (most recent call last):\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py\", line 420, in <module>\n    NameNode().execute()\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py\", line 280, in execute\n    method(env)\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py\", line 101, in start\n    upgrade_suspended=params.upgrade_suspended, env=env)\n  File \"/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py\", line 89, in thunk\n    return fn(*args, **kwargs)\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py\", line 184, in namenode\n    if is_this_namenode_active() is False:\n  File \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py\", line 55, in wrapper\n    return function(*args, **kwargs)\n  File \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py\", line 549, in is_this_namenode_active\n    raise Fail(format(\"The NameNode {namenode_id} is not listed as Active or Standby, waiting...\"))\nresource_management.core.exceptions.Fail: The NameNode nn2 is not listed as Active or Standby, waiting...",
      

      When the UI enables NN HA first starts NN1 than NN2. At this stage both NNs are in 'standby' mode. The active node will be elected only later ( I believe when ZKFC is installed and started) thus I think the second NN start shouldn't be failed if no active name node was found:

      1st NN start:

      nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal
      2016-08-24 23:08:20,037 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn1', 'nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal:50070')], unknown_namenodes = [(u'nn2', 'nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal:50070')]
      2016-08-24 23:08:20,037 - No active NameNode was found after 5 retries. Will return current NameNode HA states
      2016-08-24 23:08:20,037 - Skipping Safemode check due to the following conditions: HA: True, isActive: False, upgradeType: None
      2016-08-24 23:08:20,037 - Skipping creation of HDFS directories since this is either not the Active NameNode or we did not wait for Safemode to finish.
      
      Command completed successfully!
      

      2nd NN start:

      nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal
      2016-08-24 23:10:51,011 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn1', 'nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal:50070'), (u'nn2', 'nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal:50070')], unknown_namenodes = []
      2016-08-24 23:10:51,012 - No active NameNode was found after 5 retries. Will return current NameNode HA states
      
      Command failed after 1 tries
      

      Since the 2nd NN start failed the wizard does not continue with installing ZKFC and rest of the steps.

      Attachments

        1. AMBARI-18262.patch
          4 kB
          Jonathan Hurley

        Issue Links

          Activity

            People

              jonathanhurley Jonathan Hurley
              jonathanhurley Jonathan Hurley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: