Thank you for the patch. The approach is correct, you understood exactly what I meant in my previous comment.
I created a branch with the initial commit and some suggestions in a follow-up commit. I've also started the tests, the results are still pending:
The suggestions can be summarized as follows:
- logging_slow_query_threshold_in_ms in cassandra.yaml was renamed to query_log_timeout_in_ms and the default value increased from 200 to 500 milliseconds; I know this is very high but I prefer to be on the safe side, WDYT?
- The methods associated with the retrieval of the configuration property have been renamed for consistency with the new yaml property name.
- With the initial version of the patch, the same query would have been potentially added to the slow queries queue several times, I've changed MonitorableImpl so that we log a slow query only when it is completed. This has the advantage that we have the accurate total time for the query, and that we do not log queries that time out twice.
- Given the change above, it was then possible to refactor further MonitorTask, which I did. There was also a problem in that the slow queries queue was bounded but we did not report in the log message that some slow query logs were missing due to a full queue (via ... in the message), the refactoring should have fixed this. I've also added some comments (unrelated to this patch, they should already have been there).
- I think we can log both slow and timed out queries with the same scheduling task.
- The unit tests have been changed to reflect the points above.
- In MonitorTask the summary log message for slow queries was downgraded from WARN to INFO, and the logger wrapped with NoSpamLogger, which will log the summary message at most once every 5 minutes. NoSpamLogger is also used for the summary log message for timed-out queries, although this is still at WARN level. However, detailed logs will still be logged every time at DEBUG level in debug.log.
Finally, I've created some distributed tests here. You can run them to see what the log messages will look like. Here is a sample:
INFO [ScheduledTasks:1] 2016-08-10 16:56:50,209 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2016-08-10 16:56:50,211 MonitoringTask.java:173 - 3 operations were slow in the last 4998 msecs:
<SELECT * FROM ks.test2 LIMIT 5000>, time 3026 msec - slow timeout 30 msec
<SELECT * FROM ks.test2 WHERE id = 1 LIMIT 5000>, was slow 2 times: avg/min/max 330/325/335 msec - slow timeout 30 msec
<SELECT * FROM ks.test2 WHERE token(id) > 0 LIMIT 5000>, time 1449 msec - slow timeout 30 msec
The INFO message appears in system.log and debug.log but not more than once every 5 minutes. The DEBUG log appears every 5 seconds (configurable) if there is at least a slow query.
Could you review the follow-up commit and let us know what you think? Please do not hesitate to express any concerns, especially regarding the logging choices, since they are intended for operators.