Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-2836

charm metric collector race condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.2.0, 1.2.1
    • 1.3.0
    • deployment
    • None

    Description

      Initially thought fixed in BIGTOP-2801, it seems the charm metric collector can still cause a failed deployment. As a refresher, metrics give users the ability see stuff like how many datanodes or zookeeper peers are deployed in an environment.

      The first attempt at fixing this was to include a precondition before collecting metrics, for example, ensure the namenode is "ready" before running "hdfs getconf".

      However, in this example, there can be a period of time where the charm tells the NN to start (at which point the "ready" state is set), yet the NN takes a while to format HDFS. If the metric collector runs during this time, 'hdfs getconf' will fail, which means the metric hook fails, which means the deployment fails.

      There are a variety of ways to mitigate this:

      1. Don't set "ready" until the NN is all the way up.
      2. Don't let a metric hook fail the entire deployment.
      3. Alter the collector so it handles a failed 'hdfs getconf' gracefully.

      #1: added to our todo, but will take more time to implement.
      #2: opened an issue against the metric layer to see if this is possible.

      This JIRA will focus on fixing the problem with option #3.

      Attachments

        Issue Links

          Activity

            People

              kwmonroe Kevin Monroe
              kwmonroe Kevin Monroe
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: