Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28963

Updating Quota Factors is too expensive

    XMLWordPrintableJSON

Details

    • The horizontal scalability of the Quotas refresh chore was improved. A side effect of this change is that each Quotas cache miss will not result in an immediate refreshing of the cache.

    Description

      My company is running Quotas across a few hundred clusters of varied size. One cluster has hundreds of servers and tens of thousands of regions. We noticed that the HMaster was quite busy for this cluster, and after some investigation we realized that RegionServers were hammering the HMaster's ClusterMetrics endpoint to facilitate the refreshing of table machine quota factors.

      There are a few things that we could do here — in a perfect world, I think the RegionServers would have a better P2P communication of region states, and whatever else is, necessary to derive new quota factors. Relying solely on the HMaster for this coordination creates a tricky bottleneck for the horizontal scalability of clusters.

      That said, I think that a simpler and preferable initial step would be to make our code a bit more cost conscious. At my company, for example, we don't even define any table-scoped quotas. Without any table scoped quotas in the cache, our cache could be much more thoughtful about the work that it chooses to do on each refresh. So I'm proposing that we check the size of the tableQuotaCache keyset earlier, and use this inference to determine what ClusterMetrics we bother to fetch.

      Attachments

        1. image-2024-11-06-12-06-44-317.png
          435 kB
          Ray Mattingly
        2. quota-refresh-hmaster.png
          435 kB
          Ray Mattingly

        Activity

          People

            rmdmattingly Ray Mattingly
            rmdmattingly Ray Mattingly
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: