Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3025

Add metric for the open file descriptors usage vs the limit

    XMLWordPrintableJSON

Details

    Description

      In the case of even replica distribution across all available nodes, once one tablet server hits the maximum number of open file descriptors and go down (e.g., upon hosting another tablet replica), the system will automatically re-replicate tablet replicas from the tablet server, most likely bringing other tablet servers down as well. That's a cascading failure scenario that nobody wants to experience.

      Monitoring the number of open file descriptors vs the limit can help to prevent full Kudu cluster outage in such case, if operators are given a chance to handle those situations proactively. Once some threshold is reached (e.g., 90%), an operator could update the limit via corresponding ulimit setting, preventing an outage.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aserbin Alexey Serbin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: