Details

      Description

      Hello,

      In cassandra production environment, users sometimes build anti-pattern tables and throw queries in inefficient manners.
      So I would like to suggest a feature that enables to log slow query.
      The feature can help cassandra operators to identify bad query patterns.
      Then operators can give advices about queries and data model to users who don't know cassandra so much.

      This ticket is related to CASSANDRA-6226, and I focus on detecting bad query patterns, not aborting them.

      1. sample.log
        0.9 kB
        Shogo Hoshii
      2. slow_query.patch
        26 kB
        Shogo Hoshii
      3. slow_query.patch
        6 kB
        Shogo Hoshii

        Issue Links

          Activity

          Hide
          shoshii Shogo Hoshii added a comment -

          This is a sample result with setting

          • enable_logging_slow_select_query: true
          • logging_slow_select_query_threshold_in_ms: 10
          Show
          shoshii Shogo Hoshii added a comment - This is a sample result with setting enable_logging_slow_select_query: true logging_slow_select_query_threshold_in_ms: 10
          Hide
          shoshii Shogo Hoshii added a comment -

          I attached the patch. I hope someone would review it!

          Show
          shoshii Shogo Hoshii added a comment - I attached the patch. I hope someone would review it!
          Hide
          ifesdjeen Alex Petrov added a comment -

          Java Driver has a QueryLogger.SLOW that might be helpful in such circumstances. The big advantage of having it on client side is that you can also catch queries that are caused by slowness of client and network latencies between client and the server.

          Show
          ifesdjeen Alex Petrov added a comment - Java Driver has a QueryLogger.SLOW that might be helpful in such circumstances. The big advantage of having it on client side is that you can also catch queries that are caused by slowness of client and network latencies between client and the server.
          Hide
          shoshii Shogo Hoshii added a comment -

          Hello Alex Petrov

          I appreciate your advice.
          It is very useful in the case cassandra operator can access logs in client server.
          But I cannot do that because in my company my team(operators) has a lot of users from different departments and operators do not have permission to client servers.
          And it is also difficult for my team to let all of users to use specific driver because the organisations are separated.
          (I guess some of other cassandra operators are in same circumstance as my team is.)
          So I would like to say we need to log slow query in server side.

          Show
          shoshii Shogo Hoshii added a comment - Hello Alex Petrov I appreciate your advice. It is very useful in the case cassandra operator can access logs in client server. But I cannot do that because in my company my team(operators) has a lot of users from different departments and operators do not have permission to client servers. And it is also difficult for my team to let all of users to use specific driver because the organisations are separated. (I guess some of other cassandra operators are in same circumstance as my team is.) So I would like to say we need to log slow query in server side.
          Hide
          Stefania Stefania added a comment -

          Thank you for the patch. I think this is definitely something that operators will find very useful.

          CASSANDRA-7392 introduced a poor man's "slow query log" for queries that timed out, the code is in MonitoringTask in the db.monitoring package. This is done replica side rather than coordinator side.

          We may want to build on that rather than log slow queries coordinator side. If we log coordinator side, it may be that some queries took a long time because a replica was overloaded. Replica side we can instead measure the accurate query time. Each read command is already monitored, we basically just need to add a new state to Monitorable, and notify MonitoringTask when this state is triggered.

          I would continue logging in aggregate and at debug level, like it is currently done for queries that timed out. In MonitoringTask, we would basically first log slow queries, if any, then queries that timed out, if any. We can do this with two separate queues or by generalizing (and renaming) FailedOperation.

          Show
          Stefania Stefania added a comment - Thank you for the patch. I think this is definitely something that operators will find very useful. CASSANDRA-7392 introduced a poor man's "slow query log" for queries that timed out, the code is in MonitoringTask in the db.monitoring package. This is done replica side rather than coordinator side. We may want to build on that rather than log slow queries coordinator side. If we log coordinator side, it may be that some queries took a long time because a replica was overloaded. Replica side we can instead measure the accurate query time. Each read command is already monitored, we basically just need to add a new state to Monitorable , and notify MonitoringTask when this state is triggered. I would continue logging in aggregate and at debug level, like it is currently done for queries that timed out. In MonitoringTask , we would basically first log slow queries, if any, then queries that timed out, if any. We can do this with two separate queues or by generalizing (and renaming) FailedOperation .
          Hide
          shoshii Shogo Hoshii added a comment -

          Hello Stefania,

          Thank you very much for quick and very detailed comment.
          I will look into `Monitorable`, and `MonitoringTask`.

          Show
          shoshii Shogo Hoshii added a comment - Hello Stefania , Thank you very much for quick and very detailed comment. I will look into `Monitorable`, and `MonitoringTask`.
          Hide
          ifesdjeen Alex Petrov added a comment -

          Sure, I was just trying to be helpful, as under some circumstances that'd be at least a part of possible solution. I understand that this can be helpful for the operators.

          Show
          ifesdjeen Alex Petrov added a comment - Sure, I was just trying to be helpful, as under some circumstances that'd be at least a part of possible solution. I understand that this can be helpful for the operators.
          Hide
          Stefania Stefania added a comment -

          Sure, if you have any questions don't hesitate to ask.

          Show
          Stefania Stefania added a comment - Sure, if you have any questions don't hesitate to ask.
          Hide
          shoshii Shogo Hoshii added a comment -

          Modified patch that uses MonitoringTask to log slow queries

          Show
          shoshii Shogo Hoshii added a comment - Modified patch that uses MonitoringTask to log slow queries
          Hide
          shoshii Shogo Hoshii added a comment -

          Hello Stefania

          I have modified the code according to your advice.
          I would be happy if you review it and give me some comments.

          Thank you.

          Show
          shoshii Shogo Hoshii added a comment - Hello Stefania I have modified the code according to your advice. I would be happy if you review it and give me some comments. Thank you.
          Hide
          Stefania Stefania added a comment -

          Hello Hoshii-san

          Thank you for the patch. The approach is correct, you understood exactly what I meant in my previous comment.

          I created a branch with the initial commit and some suggestions in a follow-up commit. I've also started the tests, the results are still pending:

          trunk patch testall dtest

          The suggestions can be summarized as follows:

          • logging_slow_query_threshold_in_ms in cassandra.yaml was renamed to query_log_timeout_in_ms and the default value increased from 200 to 500 milliseconds; I know this is very high but I prefer to be on the safe side, WDYT?
          • The methods associated with the retrieval of the configuration property have been renamed for consistency with the new yaml property name.
          • With the initial version of the patch, the same query would have been potentially added to the slow queries queue several times, I've changed MonitorableImpl so that we log a slow query only when it is completed. This has the advantage that we have the accurate total time for the query, and that we do not log queries that time out twice.
          • Given the change above, it was then possible to refactor further MonitorTask, which I did. There was also a problem in that the slow queries queue was bounded but we did not report in the log message that some slow query logs were missing due to a full queue (via ... in the message), the refactoring should have fixed this. I've also added some comments (unrelated to this patch, they should already have been there).
          • I think we can log both slow and timed out queries with the same scheduling task.
          • The unit tests have been changed to reflect the points above.
          • In MonitorTask the summary log message for slow queries was downgraded from WARN to INFO, and the logger wrapped with NoSpamLogger, which will log the summary message at most once every 5 minutes. NoSpamLogger is also used for the summary log message for timed-out queries, although this is still at WARN level. However, detailed logs will still be logged every time at DEBUG level in debug.log.

          Finally, I've created some distributed tests here. You can run them to see what the log messages will look like. Here is a sample:

          INFO  [ScheduledTasks:1] 2016-08-10 16:56:50,209 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log)
          DEBUG [ScheduledTasks:1] 2016-08-10 16:56:50,211 MonitoringTask.java:173 - 3 operations were slow in the last 4998 msecs:
          <SELECT * FROM ks.test2 LIMIT 5000>, time 3026 msec - slow timeout 30 msec
          <SELECT * FROM ks.test2 WHERE id = 1 LIMIT 5000>, was slow 2 times: avg/min/max 330/325/335 msec - slow timeout 30 msec
          <SELECT * FROM ks.test2 WHERE token(id) > 0 LIMIT 5000>, time 1449 msec - slow timeout 30 msec
          

          The INFO message appears in system.log and debug.log but not more than once every 5 minutes. The DEBUG log appears every 5 seconds (configurable) if there is at least a slow query.

          Could you review the follow-up commit and let us know what you think? Please do not hesitate to express any concerns, especially regarding the logging choices, since they are intended for operators.

          Show
          Stefania Stefania added a comment - Hello Hoshii-san Thank you for the patch. The approach is correct, you understood exactly what I meant in my previous comment. I created a branch with the initial commit and some suggestions in a follow-up commit. I've also started the tests, the results are still pending: trunk patch testall dtest The suggestions can be summarized as follows: logging_slow_query_threshold_in_ms in cassandra.yaml was renamed to query_log_timeout_in_ms and the default value increased from 200 to 500 milliseconds; I know this is very high but I prefer to be on the safe side, WDYT? The methods associated with the retrieval of the configuration property have been renamed for consistency with the new yaml property name. With the initial version of the patch, the same query would have been potentially added to the slow queries queue several times, I've changed MonitorableImpl so that we log a slow query only when it is completed. This has the advantage that we have the accurate total time for the query, and that we do not log queries that time out twice. Given the change above, it was then possible to refactor further MonitorTask , which I did. There was also a problem in that the slow queries queue was bounded but we did not report in the log message that some slow query logs were missing due to a full queue (via ... in the message), the refactoring should have fixed this. I've also added some comments (unrelated to this patch, they should already have been there). I think we can log both slow and timed out queries with the same scheduling task. The unit tests have been changed to reflect the points above. In MonitorTask the summary log message for slow queries was downgraded from WARN to INFO, and the logger wrapped with NoSpamLogger , which will log the summary message at most once every 5 minutes. NoSpamLogger is also used for the summary log message for timed-out queries, although this is still at WARN level. However, detailed logs will still be logged every time at DEBUG level in debug.log. Finally, I've created some distributed tests here . You can run them to see what the log messages will look like. Here is a sample: INFO [ScheduledTasks:1] 2016-08-10 16:56:50,209 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log) DEBUG [ScheduledTasks:1] 2016-08-10 16:56:50,211 MonitoringTask.java:173 - 3 operations were slow in the last 4998 msecs: <SELECT * FROM ks.test2 LIMIT 5000>, time 3026 msec - slow timeout 30 msec <SELECT * FROM ks.test2 WHERE id = 1 LIMIT 5000>, was slow 2 times: avg/min/max 330/325/335 msec - slow timeout 30 msec <SELECT * FROM ks.test2 WHERE token(id) > 0 LIMIT 5000>, time 1449 msec - slow timeout 30 msec The INFO message appears in system.log and debug.log but not more than once every 5 minutes. The DEBUG log appears every 5 seconds (configurable) if there is at least a slow query. Could you review the follow-up commit and let us know what you think? Please do not hesitate to express any concerns, especially regarding the logging choices, since they are intended for operators.
          Hide
          shoshii Shogo Hoshii added a comment -

          Hello Stefania-san,

          Thank you very much for the review and adding dtest! Very great job.

          The default value of query_log_timeout_in_ms is ok for me because most of problematic queries take more than 1,000 msec in production environment of my company. If my team have clusters in which we have to be very sensitive to slow queries, we can just change the configuration.
          I do not have any concern about the logging interval, 5 minutes. In terms of log level, monitoring tools can still detect slow query log by grepping its message even though it is INFO level.

          I am sorry for putting bugs, duplication log and log missing. I should have read the code more thoroughly.
          But I have learnt a lot from your comments, code and test.
          I am looking forward to the patch merged!

          Thank you.

          Show
          shoshii Shogo Hoshii added a comment - Hello Stefania -san, Thank you very much for the review and adding dtest! Very great job. The default value of query_log_timeout_in_ms is ok for me because most of problematic queries take more than 1,000 msec in production environment of my company. If my team have clusters in which we have to be very sensitive to slow queries, we can just change the configuration. I do not have any concern about the logging interval, 5 minutes. In terms of log level, monitoring tools can still detect slow query log by grepping its message even though it is INFO level. I am sorry for putting bugs, duplication log and log missing. I should have read the code more thoroughly. But I have learnt a lot from your comments, code and test. I am looking forward to the patch merged! Thank you.
          Hide
          Stefania Stefania added a comment - - edited

          Thank you for confirming the suggested implementation.

          There were some failures with the tests, mostly unrelated except for one failure with the new tests. I've fixed the failing test and launched a repeated execution (150 times) here. I've also launched the full tests a second time to shake off some of the unrelated failures.

          Before committing the patch, I would also like to a have one more person look at it, cc Joshua McKenzie.

          Show
          Stefania Stefania added a comment - - edited Thank you for confirming the suggested implementation. There were some failures with the tests, mostly unrelated except for one failure with the new tests. I've fixed the failing test and launched a repeated execution (150 times) here . I've also launched the full tests a second time to shake off some of the unrelated failures. Before committing the patch, I would also like to a have one more person look at it, cc Joshua McKenzie .
          Hide
          Stefania Stefania added a comment -

          Launched another multiplexed run here, previous run had problems.

          Show
          Stefania Stefania added a comment - Launched another multiplexed run here , previous run had problems.
          Hide
          Stefania Stefania added a comment -

          The multiplex run above failed due to some test infrastructure problems. Relaunched here.

          Show
          Stefania Stefania added a comment - The multiplex run above failed due to some test infrastructure problems. Relaunched here .
          Hide
          Stefania Stefania added a comment -

          There were 2 failures out of 150 runs in the multiplex job above, I've relaxed the slow query tests and relaunched here.

          Show
          Stefania Stefania added a comment - There were 2 failures out of 150 runs in the multiplex job above, I've relaxed the slow query tests and relaunched here .
          Hide
          Stefania Stefania added a comment -

          Latest multiplexed run was successful.

          Show
          Stefania Stefania added a comment - Latest multiplexed run was successful.
          Hide
          thobbs Tyler Hobbs added a comment -

          Great work on this Shogo Hoshii and Stefania. I've reviewed the patch and the dtest and everything looks good to me, so +1. I do recommend adding something to NEWS.txt pointing out the new config option when you commit, though.

          Show
          thobbs Tyler Hobbs added a comment - Great work on this Shogo Hoshii and Stefania . I've reviewed the patch and the dtest and everything looks good to me, so +1. I do recommend adding something to NEWS.txt pointing out the new config option when you commit, though.
          Hide
          Stefania Stefania added a comment -

          Thank you for the review Tyler Hobbs and for the initial patch Shogo Hoshii, committed to trunk as 6966fcda9b5ce657760dfe103e5812862306ff7a with the following text added to NEWS.txt:

               A slow query log has been added: slow queries will be logged at DEBUG level.
               For more details refer to CASSANDRA-12403 and slow_query_log_timeout_in_ms
               in cassandra.yaml.
          

          Pull request for the new tests created here.

          Show
          Stefania Stefania added a comment - Thank you for the review Tyler Hobbs and for the initial patch Shogo Hoshii , committed to trunk as 6966fcda9b5ce657760dfe103e5812862306ff7a with the following text added to NEWS.txt: A slow query log has been added: slow queries will be logged at DEBUG level. For more details refer to CASSANDRA-12403 and slow_query_log_timeout_in_ms in cassandra.yaml. Pull request for the new tests created here .
          Hide
          shoshii Shogo Hoshii added a comment -

          Stefania, Tyler Hobbs

          Thank you so much!

          Show
          shoshii Shogo Hoshii added a comment - Stefania , Tyler Hobbs Thank you so much!

            People

            • Assignee:
              shoshii Shogo Hoshii
              Reporter:
              shoshii Shogo Hoshii
              Reviewer:
              Stefania
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development