Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
@zliu41 @sahilTakiar @liyinan926 As far as I can tell, there are major issues around metrics and running gobblin under yarn:
-
-
- 1. Reporter Leak - SEVERE
-
Metric reporters are constantly added to the system and never removed.
On the app_master when the Quartz timer fires of a job, `AbstractJobLauncher.launchJob(JobListener)` is called. Right away, this calls `GobblinMetrics.startMetricReporting(Properties)`. This creates new file, jmx, kafka, and custom reporters. When these reporters are created, they are added to the collection of reporters stored in the `RootMetricContext`. Later, when the job finished, it calls `JobMetrics.stopMetricsReporting()`. This ends up calling `RootMetricContext.stopReporting()` which stops, but *does not remove* all reporters. Additionally, `GobblinMetrics.metricsReportingStarted` is set to `false`. This means that a subsequent call to `GobblinMetrics.startMetricReporting(Properties)` will cause the new reporters to be created.
The net result is that the reporter collection in RootMetricContext has unbounded growth, causes excessive numbers of threads to be created, and depending on the reporter can cause the app_master to crash.
-
-
- 2. Metric Gaps - MODERATE
-
The way reporting is started and stopped can result in gaps in metric reporting.
On the app_master when the Quartz timer fires of a job, `AbstractJobLauncher.launchJob(JobListener)` is called. Right away, this calls `GobblinMetrics.startMetricReporting(Properties)`. This calls `RootMetricContext.startReporting()`, which starts are associated reporters. Later, when the job is done it calls `GobblinMetrics.stopMetricReporting()`, which calls `RootMetricContext.stopReporting()` method that stops all associated reporters. This means that if the app_master is running two jobs, A & B, that the following scenario will cause a metric gap for Job B:
Time | Job | Action |
— | — | — |
T | A | Start |
T + 1 | B | Start |
T + 2 | A | Finish |
T + 3 | B | *No Metrics Reported* |
T + N | B | Finish |
-
-
- 3. Invalid Data - MILD
-
All file reporters are assigned all contexts, which will result in the wrong data in the files.
File metric reporters write to files with filenames which are specific to the job. Unfortunately, the list of contexts to which the reporters listen is not scoped to a given job. This means that files may contain metrics from many different jobs.
Github Url : https://github.com/linkedin/gobblin/issues/792
Github Reporter : jbaranick
Github Assignee : stakiar
Github Created At : 2016-03-03T18:19:18Z
Github Updated At : 2016-03-14T17:01:45Z
Comments
stakiar wrote on 2016-03-03T20:02:29Z : @kadaan thanks a ton for reporting and writing all this up. From what I can tell these all seem to be pretty serious issues.
It seems there may have a disconnect between the job execution model on YARN and how gobblin-metrics starts, stops, and manages metrics.
I will bring this up with the team today and see if we can create a plan of action.
Github Url : https://github.com/linkedin/gobblin/issues/792#issuecomment-191939044
liyinan926 wrote on 2016-03-03T20:06:30Z : Thanks, @sahilTakiar for calling this out. Yes, I agree that the Yarn execution model and even the MR execution model have a disconnect with gobblin-metrics, which is really more designed and implemented for single JVM usage because of its base on the Dropwizard metrics lib.
Github Url : https://github.com/linkedin/gobblin/issues/792#issuecomment-191940470
stakiar wrote on 2016-03-07T17:40:13Z : I am actively working on this. High level design is to introduce a `ApplicationLauncher` that can `start` and `stop` an application. A `BaseApplicationMaster` will control starting and stopping of metrics reporting as well as a set of core `Service`s (`JMXReporter`, etc.) and pluggable `Service`s (`JobScheduler`, `YarnService`).
Github Url : https://github.com/linkedin/gobblin/issues/792#issuecomment-193364481
jbaranick wrote on 2016-03-07T17:46:16Z : @sahilTakiar Great! That is kinda how I was thinking it should work. BTW, at least one of the reporters is job specific, FileReporter. The reason it is job specific is because it writes to a different file per job. To me this seems to indicate that there are either:
1. Two different kinds of reporters, global and local. Globals can be added at application startup, while local can be added at job startup.
2. The file reporter is global and keeps track of all open files. The file names are determined by context.
How are you thinking about this problem?
How are you taking care of item 3?
Github Url : https://github.com/linkedin/gobblin/issues/792#issuecomment-193366501
stakiar wrote on 2016-03-07T19:31:57Z : Thanks @kadaan! I need to think about item number 3 a little more. I think the work will be orthogonal to the changes for introducing the `ApplicationLauncher` so I am going to address them in different PRs.
Yes, the notes you listed sounds reasonable. I will circle back with this Issue once I have a better idea of what to do.
Github Url : https://github.com/linkedin/gobblin/issues/792#issuecomment-193412937
stakiar wrote on 2016-03-14T17:01:44Z : Issues 1 and 2 are resolved, issue 3 is still open.
Github Url : https://github.com/linkedin/gobblin/issues/792#issuecomment-196412229