[HDDS-11339] Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 2.0.0
Component/s: Ozone Manager, Ozone Recon
Labels:
- pull-request-available

Description

Two issues:

PrometheusServlet is being registered with BaseHttpServer when prometheus support is enabled and PrometheusServlet is being called every 15 secs by default as scraping interval and it publishes the hadoop metrics immediately. So if there are large number of metrics needs to be published in a very busy cluster, this makes SinkQueue gets filled up quickly and then sink cannot consume the given metrics and just dropped them outright.
A part from dropping , another issue is taking the object lock of MetricsSystemImpl class and before metrics actually being published, other threads keeps waiting to take the object lock. There was a recent issue came to highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to acquire the lock of the MetricsSystemImpl class. This makes Recon role unresponsive and after sometime JVM couldn't allocate sufficient memory and crashes with OOM. This OOM issue is not related to Recon directly as this can happen with any role who is going to use Prometheus service in a busy cluster.

Solution: We need not to publish the metrics immediately by calling

DefaultMetricsSystem.instance().publishMetricsNow();

because a prometheus sink already have a mechanism to publish metrics every 10 secs by default using call back with timer event. So we need to remove the above code to publish immediately.

Attachments

Issue Links

links to

GitHub Pull Request #7092

Activity

People

Assignee:: Devesh Kumar Singh

Reporter:: Devesh Kumar Singh

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Aug/24 10:21

Updated:: 07/Sep/24 21:19

Resolved:: 29/Aug/24 15:32