Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-11339

Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly

    XMLWordPrintableJSON

Details

    Description

      Two issues:

      1. PrometheusServlet is being registered with BaseHttpServer when prometheus support is enabled and PrometheusServlet is being called every 15 secs by default as scraping interval and it publishes the hadoop metrics immediately. So if there are large number of metrics needs to be published in a very busy cluster, this makes SinkQueue gets filled up quickly and then sink cannot consume the given metrics and just dropped them outright.
      2. A part from dropping , another issue is taking the object lock of MetricsSystemImpl class and before metrics actually being published, other threads keeps waiting to take the object lock. There was a recent issue came to highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to acquire the lock of the  MetricsSystemImpl class. This makes Recon role unresponsive and after sometime JVM couldn't allocate sufficient memory and crashes with OOM. This OOM issue is not related to Recon directly as this can happen with any role who is going to use Prometheus service in a busy cluster.

       

      Solution: We need not to publish the metrics immediately by calling 

       

      DefaultMetricsSystem.instance().publishMetricsNow();
       
      

      because a prometheus sink already have a mechanism to publish metrics every 10 secs by default using call back with timer event. So we need to remove the above code to publish immediately.

      Attachments

        Issue Links

          Activity

            People

              deveshsingh Devesh Kumar Singh
              deveshsingh Devesh Kumar Singh
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: