[BIGTOP-2836] charm metric collector race condition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2.0, 1.2.1
Fix Version/s: 1.3.0
Component/s: deployment
Labels:
None

Description

Initially thought fixed in ~~BIGTOP-2801~~, it seems the charm metric collector can still cause a failed deployment. As a refresher, metrics give users the ability see stuff like how many datanodes or zookeeper peers are deployed in an environment.

The first attempt at fixing this was to include a precondition before collecting metrics, for example, ensure the namenode is "ready" before running "hdfs getconf".

However, in this example, there can be a period of time where the charm tells the NN to start (at which point the "ready" state is set), yet the NN takes a while to format HDFS. If the metric collector runs during this time, 'hdfs getconf' will fail, which means the metric hook fails, which means the deployment fails.

There are a variety of ways to mitigate this:

1. Don't set "ready" until the NN is all the way up.
2. Don't let a metric hook fail the entire deployment.
3. Alter the collector so it handles a failed 'hdfs getconf' gracefully.

#1: added to our todo, but will take more time to implement.
#2: opened an issue against the metric layer to see if this is possible.

This JIRA will focus on fixing the problem with option #3.

Attachments

Issue Links

links to

GitHub Pull Request #252

Activity

People

Assignee:: Kevin Monroe

Reporter:: Kevin Monroe

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jul/17 16:52

Updated:: 07/Jul/17 22:05

Resolved:: 07/Jul/17 22:05