Currently, a few Processes have one-off event queue size metrics computed using PullGauges. This approach has several known disadvantages:
- Getting event queue size metrics for a Process requires changing code / re-compiling.
- The use of a pull gauge which dispatches onto the Process means it slows down metrics responses, as well as counts the queue size after the queue is flushed of all messages that arrived before the pull gauge dispatch (see MESOS-8914).
- The use of a single "size" metric means that one cannot observe the overall enqueue and dequeue throughput.
These can be replaced by introducing first-class support in libprocess for event queue metrics. For queue size / throughput, we can take the following approach:
- Use configuration to opt-in to metrics for Processes of interest. E.g. specify "master,allocator" to enable metrics for those Processes.
- Expose a pair of counters for "enqueued" and "dequeued" messages. Size of the queue can also be calculated by the user by subtracting the two values. For better usability, we could expose size as a pull gauge that subtracts the two values (prone to racing) or inspects the queue size directly without a trip through the queue.