Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0
Description
Two issues:
- PrometheusServlet is being registered with BaseHttpServer when prometheus support is enabled and PrometheusServlet is being called every 15 secs by default as scraping interval and it publishes the hadoop metrics immediately. So if there are large number of metrics needs to be published in a very busy cluster, this makes SinkQueue gets filled up quickly and then sink cannot consume the given metrics and just dropped them outright.
- A part from dropping , another issue is taking the object lock of MetricsSystemImpl class and before metrics actually being published, other threads keeps waiting to take the object lock. There was a recent issue came to highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to acquire the lock of the MetricsSystemImpl class. This makes Recon role unresponsive and after sometime JVM couldn't allocate sufficient memory and crashes with OOM. This OOM issue is not related to Recon directly as this can happen with any role who is going to use Prometheus service in a busy cluster.
Solution: We need not to publish the metrics immediately by calling
DefaultMetricsSystem.instance().publishMetricsNow();
because a prometheus sink already have a mechanism to publish metrics every 10 secs by default using call back with timer event. So we need to remove the above code to publish immediately.
Attachments
Issue Links
- links to