When we hear users wondering why their workloads are slower than expected, some common questions arise. It'd be great if we had a single tool (or a single webpage) that aggregated and displayed useful information for a specific tablet or table. Things like, for a specific table:
- How many partitions and replicas exist for the table.
- For those replicas, how they are distributed across tablet servers.
- For those tablet servers, what the block cache configuration is, and what the current block cache stats (hit ratio, evictions, etc) are.
- For those tablet servers, which tablets have been written to recently.
- For those tablet servers, which tablets within the target table have been written to recently.
- For those tablet servers, how many active and non-expired scanners exist.
- For those tablet servers, which tablets within the target table have been read from recently.
- For those tablet servers, how many ongoing tablet copies there are both to and from the server.
- For those tablet servers, how many data directories there are.
- For the data directories on those tablet servers, how many replicas are spreading data in each directory, how many blocks there are in each, and how much space is available in each.
The list could go on and on. It probably makes sense to break the diagnostics into different phases or goals, maybe along the lines of 1) identifying hotspots of workloads and lag across tablet servers (e.g. a ton of writes going to a single tserver), and 2) digging into a single tablet server to understand how it's provisioned and whether that provisioning is sufficient.