Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: deployment
    • Labels:
      None

      Description

      Cloud Weather Report @ http://cwr.vapour.ws/bundle_hadoop_processing/dfd2c76b5e7e49be94f09b19bfd07aa7/report.html has shown that on GCE there is a hook failure on the slave application in the hadoop-processing bundle:

      2016-11-01 15:46:07 [DEBUG] deployer.env: Delta application: slave change:{u'current': u'error', u'message': u'hook failed: "namenode-relation-changed"', u'since': u'2016-11-01T15:46:06.083029127Z', u'data':

      {u'hook': u'namenode-relation-changed', u'remote-unit': u'namenode/0', u'relation-id': 1}

      , u'version': u''}
      2016-11-01 15:46:07 [DEBUG] deployer.env: Delta unit: slave/0 change:{u'current': u'error', u'message': u'hook failed: "namenode-relation-changed"', u'since': u'2016-11-01T15:46:06.083029127Z', u'data':

      {u'hook': u'namenode-relation-changed', u'remote-unit': u'namenode/0', u'relation-id': 1}

      , u'version': u''}
      2016-11-01 15:46:07 [ERROR] deployer.env: The following units had errors:
      unit: slave/0: machine: 1 agent-state: error details: hook failed: "namenode-relation-changed"

      1. cwr-gce--unit-slave-2.log.txt
        341 kB
        Kevin W Monroe
      2. cwr-gce--unit-namenode-0.log.txt
        285 kB
        Kevin W Monroe

        Issue Links

          Activity

          Hide
          kwmonroe Kevin W Monroe added a comment -

          Just a quick note here, this isn't a "hadoop" failure, but rather a failure in the bundle deployment. Hence moving to "deployment" component.

          Also worth mentioning, it's sporadic. Successful GCE run of hadoop-processing this morning:

          http://data.vapour.ws/cwr-tests/results/bundle_hadoop_processing/4921b489a8ca4a08850c91deb704898f/report.html

          Show
          kwmonroe Kevin W Monroe added a comment - Just a quick note here, this isn't a "hadoop" failure, but rather a failure in the bundle deployment. Hence moving to "deployment" component. Also worth mentioning, it's sporadic. Successful GCE run of hadoop-processing this morning: http://data.vapour.ws/cwr-tests/results/bundle_hadoop_processing/4921b489a8ca4a08850c91deb704898f/report.html
          Hide
          kwmonroe Kevin W Monroe added a comment -

          Jenkins machine log for slave/2 that shows it timed out waiting for hdfs to become ready.

          Show
          kwmonroe Kevin W Monroe added a comment - Jenkins machine log for slave/2 that shows it timed out waiting for hdfs to become ready.
          Hide
          kwmonroe Kevin W Monroe added a comment -

          Jenkins machine log for namenode/0 that shows:

          2016-11-01 15:41:50 INFO install Error: Could not start Service[hadoop-hdfs-namenode]: Execution of '/usr/sbin/service hadoop-hdfs-namenode start' returned 1: Job for hadoop-hdfs-namenode.service failed because the control process exited with error code. See "systemctl status hadoop-hdfs-namenode.service" and "journalctl -xe" for details.

          I don't have more info on why hadoop failed to start, but one possible workaround would be for the charm to detect this failure and retry the sysctl start. I'll keep investigating.

          Show
          kwmonroe Kevin W Monroe added a comment - Jenkins machine log for namenode/0 that shows: 2016-11-01 15:41:50 INFO install Error: Could not start Service [hadoop-hdfs-namenode] : Execution of '/usr/sbin/service hadoop-hdfs-namenode start' returned 1: Job for hadoop-hdfs-namenode.service failed because the control process exited with error code. See "systemctl status hadoop-hdfs-namenode.service" and "journalctl -xe" for details. I don't have more info on why hadoop failed to start, but one possible workaround would be for the charm to detect this failure and retry the sysctl start. I'll keep investigating.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user kwmonroe opened a pull request:

          https://github.com/apache/bigtop/pull/162

          BIGTOP-2570: ensure bigtop services are started

          We need to make sure our bigtop services are actually started before setting the `.started` state.

          Without this fix, our NN will set `.started` even if `host.service_restart('hadoop-hdfs-namenode')` fails. This is bad because our slave/datanode units will attempt to do things (like relate to the NN) when they can't.

          This doesn't help us know why the service failed to start, but it does prevent other charms from relating to an unready application.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/juju-solutions/bigtop bug/BIGTOP-2570/tweak-service-started

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/bigtop/pull/162.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #162


          commit 70f4eb2ae7b61da4014270016dbf854e76f44405
          Author: Kevin W Monroe <kevin.monroe@canonical.com>
          Date: 2016-11-17T00:08:27Z

          do not set NN and RM .started states unless we know the service is started


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user kwmonroe opened a pull request: https://github.com/apache/bigtop/pull/162 BIGTOP-2570 : ensure bigtop services are started We need to make sure our bigtop services are actually started before setting the `.started` state. Without this fix, our NN will set `.started` even if `host.service_restart('hadoop-hdfs-namenode')` fails. This is bad because our slave/datanode units will attempt to do things (like relate to the NN) when they can't. This doesn't help us know why the service failed to start, but it does prevent other charms from relating to an unready application. You can merge this pull request into a Git repository by running: $ git pull https://github.com/juju-solutions/bigtop bug/ BIGTOP-2570 /tweak-service-started Alternatively you can review and apply these changes as the patch at: https://github.com/apache/bigtop/pull/162.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #162 commit 70f4eb2ae7b61da4014270016dbf854e76f44405 Author: Kevin W Monroe <kevin.monroe@canonical.com> Date: 2016-11-17T00:08:27Z do not set NN and RM .started states unless we know the service is started
          Hide
          kwmonroe Kevin W Monroe added a comment -

          Linked PR is ready for review.

          Hadoop NN, RM, Slave, and Plugin charms have been built and pushed to ~bigdata-dev, so the bundle-dev.yaml from any of the hadoop-* bundles can be used to test these.

          Show
          kwmonroe Kevin W Monroe added a comment - Linked PR is ready for review. Hadoop NN, RM, Slave, and Plugin charms have been built and pushed to ~bigdata-dev, so the bundle-dev.yaml from any of the hadoop-* bundles can be used to test these.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user kwmonroe commented on the issue:

          https://github.com/apache/bigtop/pull/162

          Though this started with a systemctl service that failed to start, it has turned into a crusade to make charm actions/status/logging better so we can more easily debug problems like this in the future.

          In addition to handling `start_foo` better, i added some additional logging, corrected some bad status, and updated actions to better log their output. I've updated the jira/pr title to reflect this extra scope.

          Show
          githubbot ASF GitHub Bot added a comment - Github user kwmonroe commented on the issue: https://github.com/apache/bigtop/pull/162 Though this started with a systemctl service that failed to start, it has turned into a crusade to make charm actions/status/logging better so we can more easily debug problems like this in the future. In addition to handling `start_foo` better, i added some additional logging, corrected some bad status, and updated actions to better log their output. I've updated the jira/pr title to reflect this extra scope.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ktsakalozos commented on a diff in the pull request:

          https://github.com/apache/bigtop/pull/162#discussion_r88909017

          — Diff: bigtop-packages/src/charm/hadoop/layer-hadoop-resourcemanager/reactive/resourcemanager.py —
          @@ -131,16 +131,28 @@ def send_nn_spec(namenode):
          @when_not('apache-bigtop-resourcemanager.started')
          def start_resourcemanager(namenode):
          hookenv.status_set('maintenance', 'starting resourcemanager')

          • # NB: service should be started by install, but this may be handy in case
          • # we have something that removes the .started state in the future. Also
          • # note we restart here in case we modify conf between install and now.
          • host.service_restart('hadoop-yarn-resourcemanager')
          • host.service_restart('hadoop-mapreduce-historyserver')
          • for port in get_layer_opts().exposed_ports('resourcemanager'):
          • hookenv.open_port(port)
          • set_state('apache-bigtop-resourcemanager.started')
          • hookenv.application_version_set(get_hadoop_version())
          • hookenv.status_set('maintenance', 'resourcemanager started')
            + # NB: service should be started by install, but we want to verify it is
            + # running before we set the .started state and open ports. We always
            + # restart here, which may seem heavy-handed. However, restart works
            + # whether the service is currently started or stopped. It also ensures the
            + # service is using the most current config.
            + rm_started = host.service_restart('hadoop-yarn-resourcemanager')
            + if rm_started:
            + for port in get_layer_opts().exposed_ports('resourcemanager'):
            + hookenv.open_port(port)
            + set_state('apache-bigtop-resourcemanager.started')
            + hookenv.status_set('maintenance', 'resourcemanager started')
            + hookenv.application_version_set(get_hadoop_version())
            + else:
            + hookenv.log('YARN ResourceManager failed to start')
            + hookenv.status_set('blocked', 'resourcemanager failed to start')
            + remove_state('apache-bigtop-resourcemanager.started')
              • End diff –

          I guess this is "just in case"

          Show
          githubbot ASF GitHub Bot added a comment - Github user ktsakalozos commented on a diff in the pull request: https://github.com/apache/bigtop/pull/162#discussion_r88909017 — Diff: bigtop-packages/src/charm/hadoop/layer-hadoop-resourcemanager/reactive/resourcemanager.py — @@ -131,16 +131,28 @@ def send_nn_spec(namenode): @when_not('apache-bigtop-resourcemanager.started') def start_resourcemanager(namenode): hookenv.status_set('maintenance', 'starting resourcemanager') # NB: service should be started by install, but this may be handy in case # we have something that removes the .started state in the future. Also # note we restart here in case we modify conf between install and now. host.service_restart('hadoop-yarn-resourcemanager') host.service_restart('hadoop-mapreduce-historyserver') for port in get_layer_opts().exposed_ports('resourcemanager'): hookenv.open_port(port) set_state('apache-bigtop-resourcemanager.started') hookenv.application_version_set(get_hadoop_version()) hookenv.status_set('maintenance', 'resourcemanager started') + # NB: service should be started by install, but we want to verify it is + # running before we set the .started state and open ports. We always + # restart here, which may seem heavy-handed. However, restart works + # whether the service is currently started or stopped. It also ensures the + # service is using the most current config. + rm_started = host.service_restart('hadoop-yarn-resourcemanager') + if rm_started: + for port in get_layer_opts().exposed_ports('resourcemanager'): + hookenv.open_port(port) + set_state('apache-bigtop-resourcemanager.started') + hookenv.status_set('maintenance', 'resourcemanager started') + hookenv.application_version_set(get_hadoop_version()) + else: + hookenv.log('YARN ResourceManager failed to start') + hookenv.status_set('blocked', 'resourcemanager failed to start') + remove_state('apache-bigtop-resourcemanager.started') End diff – I guess this is "just in case"
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ktsakalozos commented on the issue:

          https://github.com/apache/bigtop/pull/162

          LGTM +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user ktsakalozos commented on the issue: https://github.com/apache/bigtop/pull/162 LGTM +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user johnsca commented on a diff in the pull request:

          https://github.com/apache/bigtop/pull/162#discussion_r88919639

          — Diff: bigtop-packages/src/charm/hadoop/layer-hadoop-resourcemanager/reactive/resourcemanager.py —
          @@ -131,16 +131,28 @@ def send_nn_spec(namenode):
          @when_not('apache-bigtop-resourcemanager.started')
          def start_resourcemanager(namenode):
          hookenv.status_set('maintenance', 'starting resourcemanager')

          • # NB: service should be started by install, but this may be handy in case
          • # we have something that removes the .started state in the future. Also
          • # note we restart here in case we modify conf between install and now.
          • host.service_restart('hadoop-yarn-resourcemanager')
          • host.service_restart('hadoop-mapreduce-historyserver')
          • for port in get_layer_opts().exposed_ports('resourcemanager'):
          • hookenv.open_port(port)
          • set_state('apache-bigtop-resourcemanager.started')
          • hookenv.application_version_set(get_hadoop_version())
          • hookenv.status_set('maintenance', 'resourcemanager started')
            + # NB: service should be started by install, but we want to verify it is
            + # running before we set the .started state and open ports. We always
            + # restart here, which may seem heavy-handed. However, restart works
            + # whether the service is currently started or stopped. It also ensures the
            + # service is using the most current config.
            + rm_started = host.service_restart('hadoop-yarn-resourcemanager')
            + if rm_started:
            + for port in get_layer_opts().exposed_ports('resourcemanager'):
            + hookenv.open_port(port)
            + set_state('apache-bigtop-resourcemanager.started')
            + hookenv.status_set('maintenance', 'resourcemanager started')
            + hookenv.application_version_set(get_hadoop_version())
            + else:
            + hookenv.log('YARN ResourceManager failed to start')
            + hookenv.status_set('blocked', 'resourcemanager failed to start')
            + remove_state('apache-bigtop-resourcemanager.started')
              • End diff –

          No, this will prevent related services from trying to use the RM before it is started.

          Show
          githubbot ASF GitHub Bot added a comment - Github user johnsca commented on a diff in the pull request: https://github.com/apache/bigtop/pull/162#discussion_r88919639 — Diff: bigtop-packages/src/charm/hadoop/layer-hadoop-resourcemanager/reactive/resourcemanager.py — @@ -131,16 +131,28 @@ def send_nn_spec(namenode): @when_not('apache-bigtop-resourcemanager.started') def start_resourcemanager(namenode): hookenv.status_set('maintenance', 'starting resourcemanager') # NB: service should be started by install, but this may be handy in case # we have something that removes the .started state in the future. Also # note we restart here in case we modify conf between install and now. host.service_restart('hadoop-yarn-resourcemanager') host.service_restart('hadoop-mapreduce-historyserver') for port in get_layer_opts().exposed_ports('resourcemanager'): hookenv.open_port(port) set_state('apache-bigtop-resourcemanager.started') hookenv.application_version_set(get_hadoop_version()) hookenv.status_set('maintenance', 'resourcemanager started') + # NB: service should be started by install, but we want to verify it is + # running before we set the .started state and open ports. We always + # restart here, which may seem heavy-handed. However, restart works + # whether the service is currently started or stopped. It also ensures the + # service is using the most current config. + rm_started = host.service_restart('hadoop-yarn-resourcemanager') + if rm_started: + for port in get_layer_opts().exposed_ports('resourcemanager'): + hookenv.open_port(port) + set_state('apache-bigtop-resourcemanager.started') + hookenv.status_set('maintenance', 'resourcemanager started') + hookenv.application_version_set(get_hadoop_version()) + else: + hookenv.log('YARN ResourceManager failed to start') + hookenv.status_set('blocked', 'resourcemanager failed to start') + remove_state('apache-bigtop-resourcemanager.started') End diff – No, this will prevent related services from trying to use the RM before it is started.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user johnsca commented on the issue:

          https://github.com/apache/bigtop/pull/162

          LGTM as well :+1:

          Show
          githubbot ASF GitHub Bot added a comment - Github user johnsca commented on the issue: https://github.com/apache/bigtop/pull/162 LGTM as well :+1:
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/bigtop/pull/162

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/bigtop/pull/162
          Hide
          kwmonroe Kevin W Monroe added a comment -

          Core hadoop charms (NN, RM, Slave, and plugin) have been rebuilt and pushed to the charmstore. Hadoop bundles will pick these up for the next run:

          http://data.vapour.ws/cwr-tests/results/index.html

          Show
          kwmonroe Kevin W Monroe added a comment - Core hadoop charms (NN, RM, Slave, and plugin) have been rebuilt and pushed to the charmstore. Hadoop bundles will pick these up for the next run: http://data.vapour.ws/cwr-tests/results/index.html

            People

            • Assignee:
              kwmonroe Kevin W Monroe
              Reporter:
              arosales Antonio Rosales
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development