Thanks Andrew. I just reverted the change.
Why not present a histogram rather than a single threshold like this? That way we don't add a new config, present more info, and don't require a restart to change this threshold.
In our case we are mostly interested in the 95th percentile because it serves as an alarm that 5% DNs are becoming hot nodes and will likely cause job failures. A histogram is a nice idea actually. We can think about an appropriate granularity (e.g. every 5%?) for it. The only drawback is that it will add more content to NN web UI and make it busier – I imagine it will a table.
This is also a metric that could be calculated in client-side JS from existing information.
True. But I think showing on NN web UI is more convenient for admins. We proposed the change because median (50th percentile) is actually a poor metric to illustrate imbalance level; especially in a busy cluster with say > 70% overall utilization. We therefore wanted a "better median".
the config says it's a percentile, but it's really a quantile.
Good catch. We could change the config to be a real percentile to be b/w 0 and 100. Per above, we could also show a histogram instead.
So overall I like the histogram idea. Kai Sasaki What are you thoughts?