Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      In the spirit of CASSANDRA-511, it would be useful to tracing on queries to see where latency is coming from: how long did row cache lookup take? key search in the index? merging the data from the sstables? etc.

      The main difference vs setting debug logging is that debug logging is too big of a hammer; by turning on the flood of logging for everyone, you actually distort the information you're looking for. This would be something you could set per-query (or more likely per connection).

      We don't need to be as sophisticated as the techniques discussed in the following papers but they are interesting reading:

      http://research.google.com/pubs/pub36356.html
      http://www.usenix.org/events/osdi04/tech/full_papers/barham/barham_html/
      http://www.usenix.org/event/nsdi07/tech/fonseca.html

      1. 1123-v9.patch
        170 kB
        David Alves
      2. 1123-v9.txt
        170 kB
        Jonathan Ellis
      3. 1123-v9.txt
        155 kB
        Jonathan Ellis
      4. 1123-v8.patch
        195 kB
        David Alves
      5. 1123-v7.patch
        179 kB
        David Alves
      6. 1123-v6.txt
        215 kB
        Jonathan Ellis
      7. 1123.patch
        232 kB
        David Alves
      8. 1123.patch
        228 kB
        David Alves
      9. 1123.patch
        245 kB
        David Alves
      10. 1123.patch
        117 kB
        David Alves
      11. 1123-3.patch.gz
        13 kB
        amorton

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          committed. (turns out I didn't push the revert earlier, so I just left that out when I did push.)

          Show
          Jonathan Ellis added a comment - committed. (turns out I didn't push the revert earlier, so I just left that out when I did push.)
          Hide
          David Alves added a comment -

          Fixes the NPE. Problem was the result.getColumnCount call inside the logging statement (result maybe null).

          Show
          David Alves added a comment - Fixes the NPE. Problem was the result.getColumnCount call inside the logging statement (result maybe null).
          Hide
          Jonathan Ellis added a comment -

          Reverted pending tests fix.

          Show
          Jonathan Ellis added a comment - Reverted pending tests fix.
          Hide
          Jonathan Ellis added a comment -

          Hmm, looks like this broke our test log4j config somehow. ant test gives a lot of this:

              [junit] ERROR 14:02:56,567 Fatal exception in thread Thread[MigrationStage:1,5,main]
              [junit] java.lang.NullPointerException
              [junit] 	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195)
              [junit] 	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1087)
              [junit] 	at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
              [junit] 	at org.apache.cassandra.config.ColumnDefinition.readSchema(ColumnDefinition.java:247)
              [junit] 	at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1320)
              [junit] 	at org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:293)
              [junit] 	at org.apache.cassandra.db.DefsTable.mergeColumnFamilies(DefsTable.java:342)
              [junit] 	at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:255)
              [junit] 	at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:202)
              [junit] 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              [junit] 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              [junit] 	at java.lang.Thread.run(Thread.java:662)
          

          where CFS:1195 is a logger.debug call.

          ant test uses test/conf/log4j-server.properties, which just specifies a file and stdout at DEBUG.

          Show
          Jonathan Ellis added a comment - Hmm, looks like this broke our test log4j config somehow. ant test gives a lot of this: [junit] ERROR 14:02:56,567 Fatal exception in thread Thread[MigrationStage:1,5,main] [junit] java.lang.NullPointerException [junit] at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195) [junit] at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1087) [junit] at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077) [junit] at org.apache.cassandra.config.ColumnDefinition.readSchema(ColumnDefinition.java:247) [junit] at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1320) [junit] at org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:293) [junit] at org.apache.cassandra.db.DefsTable.mergeColumnFamilies(DefsTable.java:342) [junit] at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:255) [junit] at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:202) [junit] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) [junit] at java.util.concurrent.FutureTask.run(FutureTask.java:138) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [junit] at java.lang.Thread.run(Thread.java:662) where CFS:1195 is a logger.debug call. ant test uses test/conf/log4j-server.properties, which just specifies a file and stdout at DEBUG.
          Hide
          Jonathan Ellis added a comment -

          committed, thanks!

          Show
          Jonathan Ellis added a comment - committed, thanks!
          Hide
          David Alves added a comment -

          +1, wfm

          Show
          David Alves added a comment - +1, wfm
          Hide
          Jonathan Ellis added a comment -

          How about this? v9 is v7 with an ExecuteOnlyExecutor to remind us in the future not to use submit on the trace stage.

          Show
          Jonathan Ellis added a comment - How about this? v9 is v7 with an ExecuteOnlyExecutor to remind us in the future not to use submit on the trace stage.
          Hide
          David Alves added a comment -

          Moves the error handling out of debuggableTPE into a parent TPE that is also used as the bare TPE for tracing, debuggable TPE now does only tracing plus the blocking queue.

          Show
          David Alves added a comment - Moves the error handling out of debuggableTPE into a parent TPE that is also used as the bare TPE for tracing, debuggable TPE now does only tracing plus the blocking queue.
          Hide
          Jonathan Ellis added a comment -

          an exception from a FutureTask will not be seen unless get() is called

          Thus, the default exception handler doesn't matter. This is the main problem DTPE solves. (That, and block-until-room-on-queue which is a better policy than the defaults.)

          I note that we always use .execute instead of .submit on the tracing stage for now, but I'd like to cover our bases or this could be a real head-scratcher if/when we do exercise the submit path.

          Show
          Jonathan Ellis added a comment - an exception from a FutureTask will not be seen unless get() is called Thus, the default exception handler doesn't matter. This is the main problem DTPE solves. (That, and block-until-room-on-queue which is a better policy than the defaults.) I note that we always use .execute instead of .submit on the tracing stage for now, but I'd like to cover our bases or this could be a real head-scratcher if/when we do exercise the submit path.
          Hide
          David Alves added a comment - - edited

          maybe I'm missing something, but I think we're covered:

          • since the responsibility of logging such an event would always fall on the writing thread my point was that any log event generated by that thread would already go to all appenders but the tracing one because the TL state doen't get propagated there (for instance if we used try/catch in runMayThrow or if we logged in afterExecute).
          • CassandraDaemon already sets a default exception handler that logs to error.
          • The only thing that is not done is the test-whether-exception-handler-is-set-and-if-not-log dance that DTPE does but since one such handler is set this shouldn't be a problem, correct?

          All this being said nothing like making sure by testing. will post back.

          Show
          David Alves added a comment - - edited maybe I'm missing something, but I think we're covered: since the responsibility of logging such an event would always fall on the writing thread my point was that any log event generated by that thread would already go to all appenders but the tracing one because the TL state doen't get propagated there (for instance if we used try/catch in runMayThrow or if we logged in afterExecute). CassandraDaemon already sets a default exception handler that logs to error. The only thing that is not done is the test-whether-exception-handler-is-set-and-if-not-log dance that DTPE does but since one such handler is set this shouldn't be a problem, correct? All this being said nothing like making sure by testing. will post back.
          Hide
          Jonathan Ellis added a comment -

          after I had done this I realized that there was no point

          the point was that an exception from a FutureTask will not be seen unless get() is called, or we do some extra gymnastics a la DTPE. So what we want the tracing executor to do is to logExceptionsAfterExecute like DTPE, but we only want that to be logged to a file, we don't want to try to insert it as a tracing event.

          Show
          Jonathan Ellis added a comment - after I had done this I realized that there was no point the point was that an exception from a FutureTask will not be seen unless get() is called, or we do some extra gymnastics a la DTPE. So what we want the tracing executor to do is to logExceptionsAfterExecute like DTPE, but we only want that to be logged to a file, we don't want to try to insert it as a tracing event.
          Hide
          David Alves added a comment -

          (previous results were with regard to writes/sec rate), int_op_rate/key_rate see the same results, avg latency does not see a difference

          Show
          David Alves added a comment - (previous results were with regard to writes/sec rate), int_op_rate/key_rate see the same results, avg latency does not see a difference
          Hide
          David Alves added a comment -

          ran some smoke tests (locally): 0.1% tracing probability under full load causes ~5.5% slowdown (in writes), without any tracing event being discarded.

          Show
          David Alves added a comment - ran some smoke tests (locally): 0.1% tracing probability under full load causes ~5.5% slowdown (in writes), without any tracing event being discarded.
          Hide
          David Alves added a comment -

          Hmm, can we do something clever like http://stackoverflow.com/questions/4046228/log4j-excluding-the-logging-of-some-classes to just log o.a.c.tracing to stdout + R so we don't just error out silently when something goes wrong w/ tracing?

          after I had done this I realized that there was no point. all log events from everywhere (including o.a.c.tracing) but the ones coming from tracing stage executions are already present in all logs (tracing, rolling and stdout) if at the debug level. tracing writes do not inherit the debuggableTPE, which means log events coming from a tracing write do not go the tracing appender but do go the rest of the appenders by default already.

          Show
          David Alves added a comment - Hmm, can we do something clever like http://stackoverflow.com/questions/4046228/log4j-excluding-the-logging-of-some-classes to just log o.a.c.tracing to stdout + R so we don't just error out silently when something goes wrong w/ tracing? after I had done this I realized that there was no point. all log events from everywhere (including o.a.c.tracing) but the ones coming from tracing stage executions are already present in all logs (tracing, rolling and stdout) if at the debug level. tracing writes do not inherit the debuggableTPE, which means log events coming from a tracing write do not go the tracing appender but do go the rest of the appenders by default already.
          Hide
          David Alves added a comment -
          • Fixed NPE in TracingAppender because of inlining isTracing() (the instance may not be built at the time the appender is started)
          • Added settraceprobability to nodetool
          • Removed tracing settings from stress (assumed that was the goal, happy to readd if needed)
          Show
          David Alves added a comment - Fixed NPE in TracingAppender because of inlining isTracing() (the instance may not be built at the time the appender is started) Added settraceprobability to nodetool Removed tracing settings from stress (assumed that was the goal, happy to readd if needed)
          Hide
          David Alves added a comment -

          Hmm, can we do something clever like http://stackoverflow.com/questions/4046228/log4j-excluding-the-logging-of-some-classes to just log o.a.c.tracing to stdout + R so we don't just error out silently when something goes wrong w/ tracing?

          We can, I'm taking care of that on the next patch along with nodetool. Although, just to be clear, using a bare TPE is still the way to go because we still wouldn't want to propagate tracing to tracing mutation threads.

          nice job on the DTPE cleanup.

          Show
          David Alves added a comment - Hmm, can we do something clever like http://stackoverflow.com/questions/4046228/log4j-excluding-the-logging-of-some-classes to just log o.a.c.tracing to stdout + R so we don't just error out silently when something goes wrong w/ tracing? We can, I'm taking care of that on the next patch along with nodetool. Although, just to be clear, using a bare TPE is still the way to go because we still wouldn't want to propagate tracing to tracing mutation threads. nice job on the DTPE cleanup.
          Hide
          Jonathan Ellis added a comment -

          v6 attached. adds trace session.finished_at, adds back debug logging for thrift methods when tracing is off, removes probability from thrift interface.

          uses newTaskFor in DTPE, but overrides execute as well. (otherwise we wrap, then newTaskFor wraps a second time. might as well just wrap once.)

          TODO: exposing probability to nodetool.

          Show
          Jonathan Ellis added a comment - v6 attached. adds trace session.finished_at, adds back debug logging for thrift methods when tracing is off, removes probability from thrift interface. uses newTaskFor in DTPE, but overrides execute as well. (otherwise we wrap, then newTaskFor wraps a second time. might as well just wrap once.) TODO: exposing probability to nodetool.
          Hide
          Jonathan Ellis added a comment -

          the JMX one is a descendant of DebuggableTP and that one propagates the trace state which we wouldn't want when writing the traces themselves

          Hmm, can we do something clever like http://stackoverflow.com/questions/4046228/log4j-excluding-the-logging-of-some-classes to just log o.a.c.tracing to stdout + R so we don't just error out silently when something goes wrong w/ tracing?

          Show
          Jonathan Ellis added a comment - the JMX one is a descendant of DebuggableTP and that one propagates the trace state which we wouldn't want when writing the traces themselves Hmm, can we do something clever like http://stackoverflow.com/questions/4046228/log4j-excluding-the-logging-of-some-classes to just log o.a.c.tracing to stdout + R so we don't just error out silently when something goes wrong w/ tracing?
          Hide
          David Alves added a comment -

          final touches:

          • JMX enabled tracing
          • simplified enabling tracing
          • completed cli and stress
          Show
          David Alves added a comment - final touches: JMX enabled tracing simplified enabling tracing completed cli and stress
          Hide
          David Alves added a comment -

          All in all I think this version works fine to do distributed debugging a single request but that profiling sets of request would benefit from the old schema/additional typed info.

          Show
          David Alves added a comment - All in all I think this version works fine to do distributed debugging a single request but that profiling sets of request would benefit from the old schema/additional typed info.
          Hide
          David Alves added a comment -

          I'm sorry this took a bit, it was not immediate getting it to run again.

          I know this is an important issue and I don't to delay it any further, so whether or not further discussion will change stuff, here's a version that finishes the suggested implementation.

          Some thoughts:

          • +1 on the new way of declaring KS's and CF's
          • +1 on hooking the appender
          • +1 on leaving cli summary out (it was just something I made for my own purposes as I was considering the uses of the tracing system/making something readable).
          • +1 on reducing overall clutter

          Some issues:

          • in order for thrift/cql3 to work together names still need to go through ColumNameBuilder or weird stuff happens (e.g. an IAE on a bytebuffer). Also eventId needs to be prefixed to the the col names in events (and not in a col by itself)
          • the reason the the StageManager provided a bare ThreadPool for the tracingstage instead of the the JMX one was because the JMX one is a descendant of DebuggableTP and that one propagates the trace state which we wouldn't want when writing the traces themselves, what happens otherwise is an infinite loop. (also state wasn't being propagated propagated because: a) setupContext() was never called b) TPE.execute() never calls newTaskFor making meaning overriding newTaskFor doesn't work for all tasks, only for those that call submit() which are a small percentage).
          • I still think that for profiling purposes having a single table would be better, if someone wants to get the sets of all events for "batch_mutate" sessions it will require doing two queries (and won't be feasible from cqlsh for instance).
          • having a single "activity" field means that any bulk processing will have to do string parsing and that no index will help in picking certain types of events.
          • the typed payloads allowed to easily perform specific gets to answer questions such as "whats the avg returned size for get_range_slices" these questions now are much harder to answer because: a) they need multiple gets and b) the resulting string must be parsed (if it even includes the info)
          Show
          David Alves added a comment - I'm sorry this took a bit, it was not immediate getting it to run again. I know this is an important issue and I don't to delay it any further, so whether or not further discussion will change stuff, here's a version that finishes the suggested implementation. Some thoughts: +1 on the new way of declaring KS's and CF's +1 on hooking the appender +1 on leaving cli summary out (it was just something I made for my own purposes as I was considering the uses of the tracing system/making something readable). +1 on reducing overall clutter Some issues: in order for thrift/cql3 to work together names still need to go through ColumNameBuilder or weird stuff happens (e.g. an IAE on a bytebuffer). Also eventId needs to be prefixed to the the col names in events (and not in a col by itself) the reason the the StageManager provided a bare ThreadPool for the tracingstage instead of the the JMX one was because the JMX one is a descendant of DebuggableTP and that one propagates the trace state which we wouldn't want when writing the traces themselves, what happens otherwise is an infinite loop. (also state wasn't being propagated propagated because: a) setupContext() was never called b) TPE.execute() never calls newTaskFor making meaning overriding newTaskFor doesn't work for all tasks, only for those that call submit() which are a small percentage). I still think that for profiling purposes having a single table would be better, if someone wants to get the sets of all events for "batch_mutate" sessions it will require doing two queries (and won't be feasible from cqlsh for instance). having a single "activity" field means that any bulk processing will have to do string parsing and that no index will help in picking certain types of events. the typed payloads allowed to easily perform specific gets to answer questions such as "whats the avg returned size for get_range_slices" these questions now are much harder to answer because: a) they need multiple gets and b) the resulting string must be parsed (if it even includes the info)
          Hide
          Jonathan Ellis added a comment -

          Dug into this pretty hard today. Thanks for the huge effort, David!

          The good news is I think we can simplify this a great deal. The bad news is that it's a lot of code churn. Sorry about that! We probably could have used a little more back-and-forth before fleshing it out so completely.

          What I saw as I tried to add trace points was that the TraceEvent/Builder api was fairly cumbersome. What I really wanted to do was instrument all the debug logging that we've accumulated, most of which has the battle scars of being added to solve a particular tricky situation.

          So I started banging on the API and realized that I was ending up with something that resembled a logging appender. So why not hook into our existing logging api? Like Aaron's original idea, but without any copy and pasting; we can just implement a log4j LoggingAppender and life is good.

          This does mean we don't have payload maps anywhere but in the session initialization, which gets special-cased, but I'm fine with that.

          I've pushed a sketch of this approach to https://github.com/jbellis/cassandra/tree/1123-4. I can't devote another day to finish it, so I'm going to have to throw it back to David, but I've considerately left the parts that need work in a non-compiling state so it's clear where to start.

          (The main drawback is we have to be careful about where we log on the write path to avoid an infinite loop, but this was a problem with the old approach as well. At least infinite loops will be fairly obvious and easy to fix.)

          Some other points:

          • I'm violently opposed to serializing thrift objects into the trace. This just pushes the job of making them human-readable out to each consumer. Let's solve this once at the server level instead. One approach is given (but not completed) in my tree. Note that I'm okay with being barely human-readable; the main use case we're concerned with is tracing queries interactively, in which case we already know the parameters and the logging is a formality. I do want it to be possible to reconstruct a problematic query detected by probabilistic sampling, but it doesn't have to be easy. (That said, turning hex back into cli or cqlsh is at least possible without writing a Thrift deserializer first, so it has that much of an advantage over the initial patch.)
          • I'm fine with giving up some structure in exchange for ease of use. A single "activity" column, with some metadata, is enough.
          • I'm also fine with giving up any or all of the pretty printer, summary by request type, and tests, which I've made no effort to port. Would strongly prefer getting a bare bones implementation finished, then adding more functionality later. (I do note that the pretty printer probably makes more sense to grow vertically rather than horizontally.)
          • enabling tracing-by-probability should be in jmx rather than thrift. unsure if max-sessions-to-trace is useful.
          • I think you will like the improvements around DTPE and the Stage.
          • Wiring up appenders at different levels (we'd want the file appender at INFO, the tracing one at DEBUG) is kind of a bitch. Apologies in advance. Possibly useful: http://stackoverflow.com/questions/2154539/log4j-log-level-per-appender-for-a-single-logger, http://stackoverflow.com/questions/751431/how-can-i-direct-log4j-output-so-that-different-log-levels-go-to-different-appen

          Think that about covers it, but it's late and I could have missed something.

          Show
          Jonathan Ellis added a comment - Dug into this pretty hard today. Thanks for the huge effort, David! The good news is I think we can simplify this a great deal. The bad news is that it's a lot of code churn. Sorry about that! We probably could have used a little more back-and-forth before fleshing it out so completely. What I saw as I tried to add trace points was that the TraceEvent/Builder api was fairly cumbersome. What I really wanted to do was instrument all the debug logging that we've accumulated, most of which has the battle scars of being added to solve a particular tricky situation. So I started banging on the API and realized that I was ending up with something that resembled a logging appender. So why not hook into our existing logging api? Like Aaron's original idea, but without any copy and pasting; we can just implement a log4j LoggingAppender and life is good. This does mean we don't have payload maps anywhere but in the session initialization, which gets special-cased, but I'm fine with that. I've pushed a sketch of this approach to https://github.com/jbellis/cassandra/tree/1123-4 . I can't devote another day to finish it, so I'm going to have to throw it back to David, but I've considerately left the parts that need work in a non-compiling state so it's clear where to start. (The main drawback is we have to be careful about where we log on the write path to avoid an infinite loop, but this was a problem with the old approach as well. At least infinite loops will be fairly obvious and easy to fix.) Some other points: I'm violently opposed to serializing thrift objects into the trace. This just pushes the job of making them human-readable out to each consumer. Let's solve this once at the server level instead. One approach is given (but not completed) in my tree. Note that I'm okay with being barely human-readable; the main use case we're concerned with is tracing queries interactively, in which case we already know the parameters and the logging is a formality. I do want it to be possible to reconstruct a problematic query detected by probabilistic sampling, but it doesn't have to be easy. (That said, turning hex back into cli or cqlsh is at least possible without writing a Thrift deserializer first, so it has that much of an advantage over the initial patch.) I'm fine with giving up some structure in exchange for ease of use. A single "activity" column, with some metadata, is enough. I'm also fine with giving up any or all of the pretty printer, summary by request type, and tests, which I've made no effort to port. Would strongly prefer getting a bare bones implementation finished, then adding more functionality later. (I do note that the pretty printer probably makes more sense to grow vertically rather than horizontally.) enabling tracing-by-probability should be in jmx rather than thrift. unsure if max-sessions-to-trace is useful. I think you will like the improvements around DTPE and the Stage. Wiring up appenders at different levels (we'd want the file appender at INFO, the tracing one at DEBUG) is kind of a bitch. Apologies in advance. Possibly useful: http://stackoverflow.com/questions/2154539/log4j-log-level-per-appender-for-a-single-logger , http://stackoverflow.com/questions/751431/how-can-i-direct-log4j-output-so-that-different-log-levels-go-to-different-appen Think that about covers it, but it's late and I could have missed something.
          Hide
          Jonathan Ellis added a comment -

          Finally I'm having issues using the index from thrift. Is there an incompatibility (i.e., using an IndexExpression with a CQL3 table?).

          Yes, see https://issues.apache.org/jira/browse/CASSANDRA-4377?focusedCommentId=13436817&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13436817

          Show
          Jonathan Ellis added a comment - Finally I'm having issues using the index from thrift. Is there an incompatibility (i.e., using an IndexExpression with a CQL3 table?). Yes, see https://issues.apache.org/jira/browse/CASSANDRA-4377?focusedCommentId=13436817&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13436817
          Hide
          David Alves added a comment - - edited

          Attached is a patch that implements almost everything mentioned:

          • Trace is implemented asynchronously with a new Stage that has a threadpool with a single thread that refuses to execute when the queue if full ( as suggested a warn is logged, experiments say that under huge load 0.001 traceProbability still works without rejecting events). (also this threadpool is the only one that does not propagate trace events).
          • Allows to enable tracing/disable tracing from cli. Also enable tracing has two parameters traceProbability (the prop that any single request from that client gets traced) and maxTraceNumber to allow to set a maximum number of traces to do (-1 set this to Integer.MAX_INT which is also the default)
          • Adds the possibility to enable tracing in stress (using -tr probability [optionally maxNumTraces]
          • TraceEvents can be build using a fluent builder (TraceEventBuilder) that is also able to deserialize events (both from thrift and from IColumns).
          • All requests in CassandraServer start a tracing session when tracing is enabled and all parameters are stored along with the request details. This is done using a ThriftType that is able to serialize and deserialize thrift objects into Cassandra.
          • User Request/Reply, Stage Start/Finish and Message Request/Reply are traced along with specific custom requests (such as apply_mutation, and get_column_family)
          • Cli contains two other commands to explore traces:

          show tracing summary [request_name] - this displays a summary for a request type:

          Summary for sessions of request: batch_mutate
          Total Sessions: 500
          Total Events: 3190
                                 ==============================================================================
                                 |    Avg.    |   StdDev.  |   Max.     |    Min.    |     99%   |     Unit   |
          =====================================================================================================
          | Latency              |      22.48 |      28.68 |     293.90 |       0.56 |     150.38|       msec |
          |---------------------------------------------------------------------------------------------------|
          | Batch Rows           |       1.00 |       0.00 |       1.00 |       1.00 |       1.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          | Mutations            |       5.00 |       0.00 |       5.00 |       5.00 |       5.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          | Deletions            |       0.00 |       0.00 |       0.00 |       0.00 |       0.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          | Columns              |       5.00 |       0.00 |       5.00 |       5.00 |       5.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          | Counters             |       0.00 |       0.00 |       0.00 |       0.00 |       0.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          | Super Col.           |       0.00 |       0.00 |       0.00 |       0.00 |       0.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          | Written Bytes        |     170.00 |       0.00 |     170.00 |     170.00 |     170.00| amount/req |
          |---------------------------------------------------------------------------------------------------|
          
          Quickest Request sessionId: 135b0c60-eac0-11e1-0000-fe8ebeead9ff
          Slowest  Request sessionId: f09c3e10-eabf-11e1-0000-fe8ebeead9ff
          

          explain trace session [sessionId] - displays the complete set of events of a tracing session along with a to-scale timeline for its execution.

          Session Summary: f09c3e10-eabf-11e1-0000-fe8ebeead9ff
          Request: batch_mutate
          Coordinator: /127.0.0.1
          Interacting nodes: 2 {[/127.0.0.1, /127.0.0.2]}
          Duration: 293901000 nano seconds (293.901 msecs)
          Consistency Level: ONE
          Request Timeline:
          
          ||--------------------------------------------------------------------------------------------------|
          Se;/Msg                                                                                             /Se
          
                                                                                                        |-||   
                                                                                                        Msg,apply_mutation;St:[Mutation];/St:[Mutation]   
          
          
          Caption: 
          Se              - Session start (start of request: batch_mutate)
          /Se             - Session end (end of request: batch_mutate)
          St[StageName]   - Begin of Stage with StageName (e.g. Mutation)
          /St[StageName]  - End of Stage with StageName
          Msg             - Arrival of message from coordinator
          /Msg            - Reply to message from coordinator
          ???             - Custom traces 
          
          NOTE: timelines are drawn to scale but precision is limited to 100 positions.
           When multiple events occur in the same position (due to lack of precision) they are separated by ','
           When two positions are too close together trace names are separated by ';'
          
          Full event list (in order of occurrence)
          
          Id: f09c3e11-eabf-11e1-0000-fe8ebeead9ff Name: batch_mutate           Type: SESSION_START          Source: /127.0.0.1             Timestamp (millis): 1345464589425          Duration (nanos): 73000                 
          Id: f09c8c30-eabf-11e1-0000-fe8ebeead9ff Name: MessageDeparture[47681] Type: MESSAGE_DEPARTURE      Source: /127.0.0.1             Timestamp (millis): 1345464589427          Duration (nanos): 2003000               
          Id: f0c63440-eabf-11e1-0000-7fd66bb03abf Name: MessageArrival[47681]  Type: MESSAGE_ARRIVAL        Source: /127.0.0.2             Timestamp (millis): 1345464589700          Duration (nanos): 47000                 
          Id: f0c745b0-eabf-11e1-0000-7fd66bb03abf Name: MutationStage          Type: STAGE_START            Source: /127.0.0.2             Timestamp (millis): 1345464589707          Duration (nanos): 7045000               
          Id: f0c745b1-eabf-11e1-0000-7fd66bb03abf Name: apply_mutation         Type: MISC                   Source: /127.0.0.2             Timestamp (millis): 1345464589707          Duration (nanos): 25000                 
          Id: f0c793d0-eabf-11e1-0000-7fd66bb03abf Name: MutationStage          Type: STAGE_FINISH           Source: /127.0.0.2             Timestamp (millis): 1345464589709          Duration (nanos): 9617000               
          Id: f0c91a70-eabf-11e1-0000-fe8ebeead9ff Name: SESSION_END            Type: SESSION_END            Source: /127.0.0.1             Timestamp (millis): 1345464589719          Duration (nanos): 293901000          
          

          I timeline from a fully distributed deployment (this was from ccm) would be much more balanced (the lower line would be bigger)

          NOTES:

          • Some parameters are hard-coded (such as the tracing queue, the ttl for trace events and the number of trace events to fetch for the summaries). It might be better to configure these somewhere but I've refrained from adding more config variables without consensus.
          • There should be more traces.
          • I implemented a distributed system keyspace to be used elsewhere but in the end reverted to using the trace keyspace, reasoning was that a distributed system table might want different things than a tracing system (replication factor for instance). Since tracing has its own keyspace also kept CFMetadata there (Im glad to move it if it makes sense even without the dist systable or if the consensus is that trace should be there)
          • In order to display the summarizing cli command the commons math jar is required (for statistics)
          • I ended up changing the schema to:
          CREATE KEYSPACE trace WITH strategy_class = 'SimpleStrategy'
            AND strategy_options:replication_factor = '1';
          
          CREATE TABLE trace.trace_events(sessionId  timeuuid,
            coordinator       inet,
            eventId          timeuuid,
            description       text,
            duration          bigint,
            happened_at          timestamp,
            name              text,
            payload_types     map<text, text>,
            payload           map<text, blob>,
            source            inet,
            type              text,
            PRIMARY KEY (sessionId, coordinator, eventId));
          
          CREATE INDEX idx_name ON trace.trace_events (name);
          

          Note the missing Sessions table (all the info there was redundant) and the added fields for payload.

          Finally I'm having issues using the index from thrift. Is there an incompatibility (i.e., using an IndexExpression with a CQL3 table?).

          • Tests pass
          Show
          David Alves added a comment - - edited Attached is a patch that implements almost everything mentioned: Trace is implemented asynchronously with a new Stage that has a threadpool with a single thread that refuses to execute when the queue if full ( as suggested a warn is logged, experiments say that under huge load 0.001 traceProbability still works without rejecting events). (also this threadpool is the only one that does not propagate trace events). Allows to enable tracing/disable tracing from cli. Also enable tracing has two parameters traceProbability (the prop that any single request from that client gets traced) and maxTraceNumber to allow to set a maximum number of traces to do (-1 set this to Integer.MAX_INT which is also the default) Adds the possibility to enable tracing in stress (using -tr probability [optionally maxNumTraces] TraceEvents can be build using a fluent builder (TraceEventBuilder) that is also able to deserialize events (both from thrift and from IColumns). All requests in CassandraServer start a tracing session when tracing is enabled and all parameters are stored along with the request details. This is done using a ThriftType that is able to serialize and deserialize thrift objects into Cassandra. User Request/Reply, Stage Start/Finish and Message Request/Reply are traced along with specific custom requests (such as apply_mutation, and get_column_family) Cli contains two other commands to explore traces: show tracing summary [request_name] - this displays a summary for a request type: Summary for sessions of request: batch_mutate Total Sessions: 500 Total Events: 3190 ============================================================================== | Avg. | StdDev. | Max. | Min. | 99% | Unit | ===================================================================================================== | Latency | 22.48 | 28.68 | 293.90 | 0.56 | 150.38| msec | |---------------------------------------------------------------------------------------------------| | Batch Rows | 1.00 | 0.00 | 1.00 | 1.00 | 1.00| amount/req | |---------------------------------------------------------------------------------------------------| | Mutations | 5.00 | 0.00 | 5.00 | 5.00 | 5.00| amount/req | |---------------------------------------------------------------------------------------------------| | Deletions | 0.00 | 0.00 | 0.00 | 0.00 | 0.00| amount/req | |---------------------------------------------------------------------------------------------------| | Columns | 5.00 | 0.00 | 5.00 | 5.00 | 5.00| amount/req | |---------------------------------------------------------------------------------------------------| | Counters | 0.00 | 0.00 | 0.00 | 0.00 | 0.00| amount/req | |---------------------------------------------------------------------------------------------------| | Super Col. | 0.00 | 0.00 | 0.00 | 0.00 | 0.00| amount/req | |---------------------------------------------------------------------------------------------------| | Written Bytes | 170.00 | 0.00 | 170.00 | 170.00 | 170.00| amount/req | |---------------------------------------------------------------------------------------------------| Quickest Request sessionId: 135b0c60-eac0-11e1-0000-fe8ebeead9ff Slowest Request sessionId: f09c3e10-eabf-11e1-0000-fe8ebeead9ff explain trace session [sessionId] - displays the complete set of events of a tracing session along with a to-scale timeline for its execution. Session Summary: f09c3e10-eabf-11e1-0000-fe8ebeead9ff Request: batch_mutate Coordinator: /127.0.0.1 Interacting nodes: 2 {[/127.0.0.1, /127.0.0.2]} Duration: 293901000 nano seconds (293.901 msecs) Consistency Level: ONE Request Timeline: ||--------------------------------------------------------------------------------------------------| Se;/Msg /Se |-|| Msg,apply_mutation;St:[Mutation];/St:[Mutation] Caption: Se - Session start (start of request: batch_mutate) /Se - Session end (end of request: batch_mutate) St[StageName] - Begin of Stage with StageName (e.g. Mutation) /St[StageName] - End of Stage with StageName Msg - Arrival of message from coordinator /Msg - Reply to message from coordinator ??? - Custom traces NOTE: timelines are drawn to scale but precision is limited to 100 positions. When multiple events occur in the same position (due to lack of precision) they are separated by ',' When two positions are too close together trace names are separated by ';' Full event list (in order of occurrence) Id: f09c3e11-eabf-11e1-0000-fe8ebeead9ff Name: batch_mutate Type: SESSION_START Source: /127.0.0.1 Timestamp (millis): 1345464589425 Duration (nanos): 73000 Id: f09c8c30-eabf-11e1-0000-fe8ebeead9ff Name: MessageDeparture[47681] Type: MESSAGE_DEPARTURE Source: /127.0.0.1 Timestamp (millis): 1345464589427 Duration (nanos): 2003000 Id: f0c63440-eabf-11e1-0000-7fd66bb03abf Name: MessageArrival[47681] Type: MESSAGE_ARRIVAL Source: /127.0.0.2 Timestamp (millis): 1345464589700 Duration (nanos): 47000 Id: f0c745b0-eabf-11e1-0000-7fd66bb03abf Name: MutationStage Type: STAGE_START Source: /127.0.0.2 Timestamp (millis): 1345464589707 Duration (nanos): 7045000 Id: f0c745b1-eabf-11e1-0000-7fd66bb03abf Name: apply_mutation Type: MISC Source: /127.0.0.2 Timestamp (millis): 1345464589707 Duration (nanos): 25000 Id: f0c793d0-eabf-11e1-0000-7fd66bb03abf Name: MutationStage Type: STAGE_FINISH Source: /127.0.0.2 Timestamp (millis): 1345464589709 Duration (nanos): 9617000 Id: f0c91a70-eabf-11e1-0000-fe8ebeead9ff Name: SESSION_END Type: SESSION_END Source: /127.0.0.1 Timestamp (millis): 1345464589719 Duration (nanos): 293901000 I timeline from a fully distributed deployment (this was from ccm) would be much more balanced (the lower line would be bigger) NOTES: Some parameters are hard-coded (such as the tracing queue, the ttl for trace events and the number of trace events to fetch for the summaries). It might be better to configure these somewhere but I've refrained from adding more config variables without consensus. There should be more traces. I implemented a distributed system keyspace to be used elsewhere but in the end reverted to using the trace keyspace, reasoning was that a distributed system table might want different things than a tracing system (replication factor for instance). Since tracing has its own keyspace also kept CFMetadata there (Im glad to move it if it makes sense even without the dist systable or if the consensus is that trace should be there) In order to display the summarizing cli command the commons math jar is required (for statistics) I ended up changing the schema to: CREATE KEYSPACE trace WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor = '1'; CREATE TABLE trace.trace_events(sessionId timeuuid, coordinator inet, eventId timeuuid, description text, duration bigint, happened_at timestamp, name text, payload_types map<text, text>, payload map<text, blob>, source inet, type text, PRIMARY KEY (sessionId, coordinator, eventId)); CREATE INDEX idx_name ON trace.trace_events (name); Note the missing Sessions table (all the info there was redundant) and the added fields for payload. Finally I'm having issues using the index from thrift. Is there an incompatibility (i.e., using an IndexExpression with a CQL3 table?). Tests pass
          Hide
          Jonathan Ellis added a comment -

          we don't need to store results, just the query parameters (bind variables, in the CQL case)

          Show
          Jonathan Ellis added a comment - we don't need to store results, just the query parameters (bind variables, in the CQL case)
          Hide
          David Alves added a comment -

          What should we do wrt to reads then? also store the complete results (what is returned to the client) and the whole row (required to cover the examples you mentioned), or do we summarize in that case? In this case not summarizing can be costly (lots of net traffic).

          Show
          David Alves added a comment - What should we do wrt to reads then? also store the complete results (what is returned to the client) and the whole row (required to cover the examples you mentioned), or do we summarize in that case? In this case not summarizing can be costly (lots of net traffic).
          Hide
          Jonathan Ellis added a comment -

          I think we could get away with storing the number and size of both argument and results in the "coordinator" node and store the actual size of the whole row (num and size of cols) in the nodes that are performing the reads.

          That would be adequate in this instance, but if we log the arguments we arm ourselves against the scenarios we didn't think of ahead of time, as well as the ones we did.

          I think it's straightforward what we should do here. Implement parameter logging and if sometime down the road someone comes along and says "I really want to trace my application but I can't afford to log parameters" we (or he) can add some way to make it optional. Until then YAGNI applies.

          Show
          Jonathan Ellis added a comment - I think we could get away with storing the number and size of both argument and results in the "coordinator" node and store the actual size of the whole row (num and size of cols) in the nodes that are performing the reads. That would be adequate in this instance, but if we log the arguments we arm ourselves against the scenarios we didn't think of ahead of time, as well as the ones we did. I think it's straightforward what we should do here. Implement parameter logging and if sometime down the road someone comes along and says "I really want to trace my application but I can't afford to log parameters" we (or he) can add some way to make it optional. Until then YAGNI applies.
          Hide
          David Alves added a comment -

          Right, but for that example storing the args as they are wouldn't help either, which was the issue to begin with. Yet of course your point is valid. In order to accomodate the use case you mention I think we could get away with storing the number and size of both argument and results in the "coordinator" node and store the actual size of the whole row (num and size of cols) in the nodes that are performing the reads. wdyt?

          Show
          David Alves added a comment - Right, but for that example storing the args as they are wouldn't help either, which was the issue to begin with. Yet of course your point is valid. In order to accomodate the use case you mention I think we could get away with storing the number and size of both argument and results in the "coordinator" node and store the actual size of the whole row (num and size of cols) in the nodes that are performing the reads. wdyt?
          Hide
          Jonathan Ellis added a comment -

          Not for the first example I gave.

          Show
          Jonathan Ellis added a comment - Not for the first example I gave.
          Hide
          David Alves added a comment -

          Since every get_slice call will have the same arg count I don't see this being very useful.

          I actually meant storing both the size and length of the arguments and the results. would that be sufficient?

          Show
          David Alves added a comment - Since every get_slice call will have the same arg count I don't see this being very useful. I actually meant storing both the size and length of the arguments and the results. would that be sufficient?
          Hide
          Jonathan Ellis added a comment -

          this would mean that we're storing data belonging to a keyspace (that might have access control) in another keyspace

          True, but it's still opt-in. Don't trace if that's a bad fit for your use case. For the 99.9% of deployments where a single organization both operates and uses the cluster, it's very useful to know exactly what statement was slow, not just what rpc method it was. Say get_slice is slow: was it 1K row being accessed, or a 100MB row? Were we reading 10 columns or 10000? Huge difference in how to address those scenarios.

          maybe we could instead store the number and length of the arguments

          Since every get_slice call will have the same arg count I don't see this being very useful.

          Show
          Jonathan Ellis added a comment - this would mean that we're storing data belonging to a keyspace (that might have access control) in another keyspace True, but it's still opt-in. Don't trace if that's a bad fit for your use case. For the 99.9% of deployments where a single organization both operates and uses the cluster, it's very useful to know exactly what statement was slow, not just what rpc method it was. Say get_slice is slow: was it 1K row being accessed, or a 100MB row? Were we reading 10 columns or 10000? Huge difference in how to address those scenarios. maybe we could instead store the number and length of the arguments Since every get_slice call will have the same arg count I don't see this being very useful.
          Hide
          David Alves added a comment - - edited

          i get the point, the original paper mentioned that that was the case because the data was stored outside of the cluster.

          Still there is the question of authentication, even though access control is not very thorough AFAIK, the principle behind it is per keyspace access correct? this would mean that we're storing data belonging to a keyspace (that might have access control) in another keyspace (that must be outside accessible).

          I'm happy to do it either way, but maybe we could instead store the number and length of the arguments. additionally this would decrease tracing network bandwidth usage.

          Show
          David Alves added a comment - - edited i get the point, the original paper mentioned that that was the case because the data was stored outside of the cluster. Still there is the question of authentication, even though access control is not very thorough AFAIK, the principle behind it is per keyspace access correct? this would mean that we're storing data belonging to a keyspace (that might have access control) in another keyspace (that must be outside accessible). I'm happy to do it either way, but maybe we could instead store the number and length of the arguments. additionally this would decrease tracing network bandwidth usage.
          Hide
          Jonathan Ellis added a comment -

          In one of the papers this same issue raised privacy concerns

          there must be some other context there because that doesn't make sense here: these are parameters to requests operating on the database, that we're proposing to store them in. in other words: given that Cassandra already has your data, storing it a second time is not a privacy concern.

          think of it as a subset of auditing, if you prefer.

          Show
          Jonathan Ellis added a comment - In one of the papers this same issue raised privacy concerns there must be some other context there because that doesn't make sense here: these are parameters to requests operating on the database, that we're proposing to store them in. in other words: given that Cassandra already has your data, storing it a second time is not a privacy concern. think of it as a subset of auditing, if you prefer.
          Hide
          David Alves added a comment -

          Thank you for the thorough review.

          Inclined to think we should include the parameters along w/ the String request type on session start. (Object... + toString would be adequate.) Maybe even use the new List type to store the arguments (CASSANDRA-3647).

          In one of the papers this same issue raised privacy concerns that are probably even more valid for CASSANDRA since it's an open source project. IMO we should at the very least make this optional if not drop it all together.

          Tracing should be asynchronous. StorageProxy.mutate waits for a response, this is not what we want. Suggest a simple ExecutorService + queue. (If queue gets full, throw out the tracing events and log a WARN.).

          Would like tracing to log.debug the event as well. This will cut down on duplicate debug/trace code, but also give us a fallback if we can't log it remotely. This will also cut down on log spam for when we enable debug level globally – only logging requests at debug where tracing was explicitly enabled will be a huge improvement.

          +1, nice idea, was looking into doing something similar.

          get_slice uses startSession instead of startSessionIfRequested.

          There's a no-op initialization of trace context in StorageService.

          Naming: system_enable_query_details -> system_enable_query__tracing. TraceSession, TraceSessionState -> TraceSessionContext, TraceSessionContextThreadLocalState. endSession -> stopSession. getQueryDetails -> isTracingEnabled.

          Most already corrected in current version.

          CFMetaData definitions should be with the other hardcoded ones in CFMetaData.

          Let's move helpers that are only used by test code like EVENT_TYPE into the Test class.

          will do!

          Still think threadlocals are not the way to go, and this will become more clear as you try to add useful trace entries. I think you'll end up w/ a trace session "registry" like we have for MessagingService that we'll look up by session id. In that vein, I'm not sure what the afterExecute business is supposed to be doing. That stuff runs on the executor's thread, not the submitter's.

          Current version uses this to trace pre and post stage execution (which are the only trace events at the moment). I've just finished updating the cli and I'd like a chance to do some runs as is to get a sense of the usefulness of the current setup.

          Finally, a more generic keyspace name like dsystem would be nice for all distributed system tables. (We're thinking of using one for CASSANDRA-3706, for instance.)

          will look into/synchronize.

          Show
          David Alves added a comment - Thank you for the thorough review. Inclined to think we should include the parameters along w/ the String request type on session start. (Object... + toString would be adequate.) Maybe even use the new List type to store the arguments ( CASSANDRA-3647 ). In one of the papers this same issue raised privacy concerns that are probably even more valid for CASSANDRA since it's an open source project. IMO we should at the very least make this optional if not drop it all together. Tracing should be asynchronous. StorageProxy.mutate waits for a response, this is not what we want. Suggest a simple ExecutorService + queue. (If queue gets full, throw out the tracing events and log a WARN.). Would like tracing to log.debug the event as well. This will cut down on duplicate debug/trace code, but also give us a fallback if we can't log it remotely. This will also cut down on log spam for when we enable debug level globally – only logging requests at debug where tracing was explicitly enabled will be a huge improvement. +1, nice idea, was looking into doing something similar. get_slice uses startSession instead of startSessionIfRequested. There's a no-op initialization of trace context in StorageService. Naming: system_enable_query_details -> system_enable_query__tracing. TraceSession, TraceSessionState -> TraceSessionContext, TraceSessionContextThreadLocalState. endSession -> stopSession. getQueryDetails -> isTracingEnabled. Most already corrected in current version. CFMetaData definitions should be with the other hardcoded ones in CFMetaData. Let's move helpers that are only used by test code like EVENT_TYPE into the Test class. will do! Still think threadlocals are not the way to go, and this will become more clear as you try to add useful trace entries. I think you'll end up w/ a trace session "registry" like we have for MessagingService that we'll look up by session id. In that vein, I'm not sure what the afterExecute business is supposed to be doing. That stuff runs on the executor's thread, not the submitter's. Current version uses this to trace pre and post stage execution (which are the only trace events at the moment). I've just finished updating the cli and I'd like a chance to do some runs as is to get a sense of the usefulness of the current setup. Finally, a more generic keyspace name like dsystem would be nice for all distributed system tables. (We're thinking of using one for CASSANDRA-3706 , for instance.) will look into/synchronize.
          Hide
          Jonathan Ellis added a comment -

          Nit: rest of the code base places static imports after non-static.

          Inclined to think we should include the parameters along w/ the String request type on session start. (Object... + toString would be adequate.) Maybe even use the new List type to store the arguments (CASSANDRA-3647).

          get_slice uses startSession instead of startSessionIfRequested.

          a session named execute_cql_query is not very useful. Should use queryString instead. May want to just push the CQL tracing into (cql3) QueryProcessor. This will mean less code to duplicate in the native CQL protocol handler.

          Tracing should be asynchronous. StorageProxy.mutate waits for a response, this is not what we want. Suggest a simple ExecutorService + queue. (If queue gets full, throw out the tracing events and log a WARN.)

          Would like tracing to log.debug the event as well. This will cut down on duplicate debug/trace code, but also give us a fallback if we can't log it remotely. This will also cut down on log spam for when we enable debug level globally – only logging requests at debug where tracing was explicitly enabled will be a huge improvement.

          CFMetaData definitions should be with the other hardcoded ones in CFMetaData.

          Let's move helpers that are only used by test code like EVENT_TYPE into the Test class.

          There's a no-op initialization of trace context in StorageService.

          Still think threadlocals are not the way to go, and this will become more clear as you try to add useful trace entries. I think you'll end up w/ a trace session "registry" like we have for MessagingService that we'll look up by session id. In that vein, I'm not sure what the afterExecute business is supposed to be doing. That stuff runs on the executor's thread, not the submitter's.

          Naming: system_enable_query_details -> system_enable_query__tracing. TraceSession, TraceSessionState -> TraceSessionContext, TraceSessionContextThreadLocalState. endSession -> stopSession. getQueryDetails -> isTracingEnabled.

          Finally, a more generic keyspace name like dsystem would be nice for all distributed system tables. (We're thinking of using one for CASSANDRA-3706, for instance.)

          Show
          Jonathan Ellis added a comment - Nit: rest of the code base places static imports after non-static. Inclined to think we should include the parameters along w/ the String request type on session start. (Object... + toString would be adequate.) Maybe even use the new List type to store the arguments ( CASSANDRA-3647 ). get_slice uses startSession instead of startSessionIfRequested. a session named execute_cql_query is not very useful. Should use queryString instead. May want to just push the CQL tracing into (cql3) QueryProcessor. This will mean less code to duplicate in the native CQL protocol handler. Tracing should be asynchronous. StorageProxy.mutate waits for a response, this is not what we want. Suggest a simple ExecutorService + queue. (If queue gets full, throw out the tracing events and log a WARN.) Would like tracing to log.debug the event as well. This will cut down on duplicate debug/trace code, but also give us a fallback if we can't log it remotely. This will also cut down on log spam for when we enable debug level globally – only logging requests at debug where tracing was explicitly enabled will be a huge improvement. CFMetaData definitions should be with the other hardcoded ones in CFMetaData. Let's move helpers that are only used by test code like EVENT_TYPE into the Test class. There's a no-op initialization of trace context in StorageService. Still think threadlocals are not the way to go, and this will become more clear as you try to add useful trace entries. I think you'll end up w/ a trace session "registry" like we have for MessagingService that we'll look up by session id. In that vein, I'm not sure what the afterExecute business is supposed to be doing. That stuff runs on the executor's thread, not the submitter's. Naming: system_enable_query_details -> system_enable_query__tracing. TraceSession, TraceSessionState -> TraceSessionContext, TraceSessionContextThreadLocalState. endSession -> stopSession. getQueryDetails -> isTracingEnabled. Finally, a more generic keyspace name like dsystem would be nice for all distributed system tables. (We're thinking of using one for CASSANDRA-3706 , for instance.)
          Hide
          David Alves added a comment - - edited

          First patch that implements most of what was suggested, with accompanying test.

          Still to define/discuss:

          • aaraon original version "replaced" the logger which meant there were already a lot of things being traced. As is this patch is pretty "bare" in terms of trace events in the sense that stuff that goes through CassandraServer gets a new trace session but not much else is traced (even though sessions get propagated across threads and nodes). Maybe we could add events automatically when a session is propagated across stages and across nodes, therefore getting a per-stage and per-node view of how things are executed (I think this could be done rather unintrusively)?
          • I haven't added the log to the logger option yet, as I'm not completely sure how useful it is if events are generated per stage/node.
          Show
          David Alves added a comment - - edited First patch that implements most of what was suggested, with accompanying test. Still to define/discuss: aaraon original version "replaced" the logger which meant there were already a lot of things being traced. As is this patch is pretty "bare" in terms of trace events in the sense that stuff that goes through CassandraServer gets a new trace session but not much else is traced (even though sessions get propagated across threads and nodes). Maybe we could add events automatically when a session is propagated across stages and across nodes, therefore getting a per-stage and per-node view of how things are executed (I think this could be done rather unintrusively)? I haven't added the log to the logger option yet, as I'm not completely sure how useful it is if events are generated per stage/node.
          Hide
          Jonathan Ellis added a comment -

          Sure.

          Show
          Jonathan Ellis added a comment - Sure.
          Hide
          David Alves added a comment -

          was about to start tracing with StopWatch, but it's only available from 10.0. upgrade?

          Show
          David Alves added a comment - was about to start tracing with StopWatch, but it's only available from 10.0. upgrade?
          Show
          Jonathan Ellis added a comment - Relevant: http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/base/Stopwatch.html
          Hide
          Jonathan Ellis added a comment -

          So, super high level view of how I think this should work:

          Store events like this:

          CREATE TABLE trace_sessions (
            session int,
            coordinator inetaddr,
            request text,
            PRIMARY KEY (session, coordinator)
          );
          
          CREATE TABLE trace_events (
            session     int,
            coordinator inetaddr,
            id     uuid,
            source inetaddr,
            event  text,
            happened_at timestamp,
            duration int,
            PRIMARY KEY (session, coordinator, id)
          );
          

          The session table will log the operation being processed, and the events will log its progress.

          Coordinator and replicas will all just throw events into this table; they don't have to keep anything in memory. Then we get the permanent record, AND clients can query the log directly without needing to add more special-case code to the coordinator.

          One way to tell the client about session ids would be to change Aaron's API a bit: int system_trace_next_query(). Then clients can just "SELECT * from trace_events WHERE coordinator = ? and session = ?" if desired.

          Alternatively we could retain the "enable tracing on all queries until I turn it off" and add int system_get_last_query_session() but I prefer the one-method version for simplicity and because tracing probably shouldn't be left on indefinitely.

          No idea how best to expose to a Thrift-less CQL. Should we create a special-cased function there too?

          Notes:

          • coordinator will generate the session id locally. Session ids will NOT be globally unique; (id, coordinator) pair identifies a series of events
          • session comes before coordinator since we want id to be the partition key; this will create narrower rows
          • "source" is the machine logging the event, not necessarily the same as coordinator (better name needed?)
          • happened_at will come from currentTimeMillis, so will not be high resolution enough to give us the duration of many events; instead, we should use nanotime on coordinator (and replicas) to compute duration
          • We should TTL event data (1 day?) both to conserve space and to avoid problems when event ids wrap
          • Tracing N% of queries randomly and providing an external session id can be broken out into separate tickets
          • Not sure if the threadlocal approach I originally advocated is a good one. Certainly there are important aspects that are not covered by the "main" coordinator and replica threads: commitlog on the replica, and response processing on the coordinator; we may add more such thread pools in the future.
          • Aaron's hacking of the slf4j adapator is clever, but I think we should leave it out; you'd still have to piece together what happened across multiple machines. Logging into a columnfamily instead solves that problem and makes the data more accessible than log files.
          • Probably worth encapsulating "log an event to the columnfamily if tracing is enabled, otherwise log at debug to log4j" into a utility method, though.
          Show
          Jonathan Ellis added a comment - So, super high level view of how I think this should work: Store events like this: CREATE TABLE trace_sessions ( session int , coordinator inetaddr, request text, PRIMARY KEY (session, coordinator) ); CREATE TABLE trace_events ( session int , coordinator inetaddr, id uuid, source inetaddr, event text, happened_at timestamp, duration int , PRIMARY KEY (session, coordinator, id) ); The session table will log the operation being processed, and the events will log its progress. Coordinator and replicas will all just throw events into this table; they don't have to keep anything in memory. Then we get the permanent record, AND clients can query the log directly without needing to add more special-case code to the coordinator. One way to tell the client about session ids would be to change Aaron's API a bit: int system_trace_next_query() . Then clients can just "SELECT * from trace_events WHERE coordinator = ? and session = ?" if desired. Alternatively we could retain the "enable tracing on all queries until I turn it off" and add int system_get_last_query_session() but I prefer the one-method version for simplicity and because tracing probably shouldn't be left on indefinitely. No idea how best to expose to a Thrift-less CQL. Should we create a special-cased function there too? Notes: coordinator will generate the session id locally. Session ids will NOT be globally unique; (id, coordinator) pair identifies a series of events session comes before coordinator since we want id to be the partition key; this will create narrower rows "source" is the machine logging the event, not necessarily the same as coordinator (better name needed?) happened_at will come from currentTimeMillis, so will not be high resolution enough to give us the duration of many events; instead, we should use nanotime on coordinator (and replicas) to compute duration We should TTL event data (1 day?) both to conserve space and to avoid problems when event ids wrap Tracing N% of queries randomly and providing an external session id can be broken out into separate tickets Not sure if the threadlocal approach I originally advocated is a good one. Certainly there are important aspects that are not covered by the "main" coordinator and replica threads: commitlog on the replica, and response processing on the coordinator; we may add more such thread pools in the future. Aaron's hacking of the slf4j adapator is clever, but I think we should leave it out; you'd still have to piece together what happened across multiple machines. Logging into a columnfamily instead solves that problem and makes the data more accessible than log files. Probably worth encapsulating "log an event to the columnfamily if tracing is enabled, otherwise log at debug to log4j" into a utility method, though.
          Hide
          Jonathan Ellis added a comment -

          That is a good idea, although I'd still like the ability to store results as well (which would let us do always-on sampling of the workload for instance)

          Show
          Jonathan Ellis added a comment - That is a good idea, although I'd still like the ability to store results as well (which would let us do always-on sampling of the workload for instance)
          Hide
          Sylvain Lebresne added a comment -

          While we wouldn't be able to do this for the thrift API (for compatibility reason), one option would be that when tracing is enable, we return a new structure with the tracing details as part of the response (as an optional field of CqlResult).

          Show
          Sylvain Lebresne added a comment - While we wouldn't be able to do this for the thrift API (for compatibility reason), one option would be that when tracing is enable, we return a new structure with the tracing details as part of the response (as an optional field of CqlResult).
          Hide
          Jonathan Ellis added a comment -

          Aaron, do you think you'll be getting back to this? If not I'm happy to find someone to work on it.

          Show
          Jonathan Ellis added a comment - Aaron, do you think you'll be getting back to this? If not I'm happy to find someone to work on it.
          Hide
          Jonathan Ellis added a comment -

          Another useful event would be username, if authenticated. Since authentication happens once per connection we'd want a separate CF? for per-conn information.

          Show
          Jonathan Ellis added a comment - Another useful event would be username, if authenticated. Since authentication happens once per connection we'd want a separate CF? for per-conn information.
          Hide
          Jonathan Ellis added a comment -

          The CassandraServer class logs timeouts at DEBUG level (see CASSANDRA-3661). But I agree this would be a useful "event" to record.

          Show
          Jonathan Ellis added a comment - The CassandraServer class logs timeouts at DEBUG level (see CASSANDRA-3661 ). But I agree this would be a useful "event" to record.
          Hide
          Shotaro Kamio added a comment -

          Thrift request timeout should be interesting as well. Because a user gets time-out exception, but server doesn't have any log about that.
          In our production cluster, we added logging where timedout exception is thrown. It logs warning message with the query itself for later investigation.

          Show
          Shotaro Kamio added a comment - Thrift request timeout should be interesting as well. Because a user gets time-out exception, but server doesn't have any log about that. In our production cluster, we added logging where timedout exception is thrown. It logs warning message with the query itself for later investigation.
          Hide
          Jonathan Ellis added a comment -

          Definitely.

          I think we can do two things with this:

          1. Answer "what is taking so long with query X"
          2. Automatically enable tracing for say 1% of all queries and look for patterns

          (2) is important because often (1) is not repeatable, if the slowness comes from hitting disk for instance it will usually be cached the next time.

          I think we're 90% of the way there with this, but log files aren't very programatically accessible, i.e., it's difficult to implement scenario (2). I'd like to

          • make enable_logging method return the query context id
          • store trace results in a columnfamily keyed by the id so they can be retrieved and displayed (e.g. by adding EXPLAIN command to cqlsh, which would be a separate ticket)

          What do you think?

          Show
          Jonathan Ellis added a comment - Definitely. I think we can do two things with this: Answer "what is taking so long with query X" Automatically enable tracing for say 1% of all queries and look for patterns (2) is important because often (1) is not repeatable, if the slowness comes from hitting disk for instance it will usually be cached the next time. I think we're 90% of the way there with this, but log files aren't very programatically accessible, i.e., it's difficult to implement scenario (2). I'd like to make enable_logging method return the query context id store trace results in a columnfamily keyed by the id so they can be retrieved and displayed (e.g. by adding EXPLAIN command to cqlsh, which would be a separate ticket) What do you think?
          Hide
          amorton added a comment -

          Is there still interest in this ?

          Show
          amorton added a comment - Is there still interest in this ?
          Hide
          Jonathan Ellis added a comment -

          So I don't forget: CASSANDRA-3515 points out that it would be nice to track time spent scanning past tombstones. Or more directly, total number of columns we had to scan, to collect the requested live ones.

          Show
          Jonathan Ellis added a comment - So I don't forget: CASSANDRA-3515 points out that it would be nice to track time spent scanning past tombstones. Or more directly, total number of columns we had to scan, to collect the requested live ones.
          Hide
          amorton added a comment -

          The attached patch includes the following changes:

          1) added system_enable_query_details(bool) to the thrift api

          2) added "enable query details" and "disable query details" to the CLI

          3) added queryDetails state to the ClientState to store the setting per connection

          4) o.a.c.service.QueryContext as a container for the per query state. State is only created if the logging details are turned on for the query, and are stored in thread local storage and accessed through static methods.

          5) CassandraServer was modified to call QueryContext.startQuery() and stopQuery() to setup and clear the thread local query context. This is done at the entry point for each thrift method that modifies / reads data.

          6) o.a.c.concurrent.DebuggableThreadPoolExecutor was modified to copy the QueryContext if present and set it in the worker thread. The changes to the submit() and execute() methods mimic the implementation of java.util.concurrent.ThreadPool . This allows the context to flow from the connection thread to read / mutate thread pools as well as any other cross thread call a query makes.

          7) o.a.c.net.Message and MessageDeliveryTask were modified to include the QueryContext in cross node messages. The context is only sent if present AND if the query started on the local node (this stops the context flowing back to the origin). Serialising the QueryContext leverages the fact that the Message will already have the IP address in it.

          8) NOTE: The query context will only survive one hop through the cluster.

          9) the slf4j-log412 jar has been removed and the code copied into the main tree. When slf4j initialises it looks for an implementation of org.slf4j.impl.StaticLoggerBinder , this will now be the implementation in the source tree rather than the jar in lib.

          10) I modified the org.slf4j.impl.Log4jLoggerAdapter implementation, the Log4jLoggerFactory will wrap all log4j loggers in this adapter. The adapter now checks the thread local QueryContext and elevates all TRACE and DEBUG messages to info if query details have been enabled. As well as prepend each messages with either the "query_id@origin_ip" or "query_id@origin_ip message id".

          11) Added a line to ColumnFamilyStore to log how many SSTables were read during the query.

          Here's an example of the log output for a get() call starting at node2 and calling node1, the prefix is "query 1@/127.0.0.2 - " and "query 1@/127.0.0.2 message 126 -"

          INFO [pool-1-thread-1] 2011-01-13 01:39:03,423 CassandraServer.java (line 329) query 1@/127.0.0.2 - get
          INFO [pool-1-thread-1] 2011-01-13 01:39:03,458 StorageService.java (line 1391) query 1@/127.0.0.2 - Sorted endpoints are /127.0.0.1
          INFO [pool-1-thread-1] 2011-01-13 01:39:03,460 StorageProxy.java (line 377) query 1@/127.0.0.2 - weakread reading SliceByNamesReadCommand(table='Keyspace1', key=666f6f35, columnParent='QueryPath(columnFamilyName='Standard2', superColumnName='null', columnName='null')', columns=[bar,]) from 126@/127.0.0.1
          DEBUG [RequestResponseStage:1] 2011-01-13 01:39:03,489 ResponseVerbHandler.java (line 59) Processing response on an async result from 126@/127.0.0.1
          INFO [pool-1-thread-1] 2011-01-13 01:39:03,499 QueryContext.java (line 71) query 1@/127.0.0.2 - returning to client, async processing may continue


          INFO [ReadStage:3] 2011-01-13 01:39:03,481 ColumnFamilyStore.java (line 1317) query 1@/127.0.0.2 message 126 - SSTables read 1
          INFO [ReadStage:3] 2011-01-13 01:39:03,484 ReadVerbHandler.java (line 90) query 1@/127.0.0.2 message 126 - Read key 666f6f35; sending response to 126@/127.0.0.2

          NOTE: The query details did not appear in the message on the RequestResponseStage thread because the context was not sent in the return message from node1.

          This is my first major ticket, so please let me know of any issues. I chose to replace the slf4j adapter to remove the need to chase down all the log calls and check the QueryContext.

          Show
          amorton added a comment - The attached patch includes the following changes: 1) added system_enable_query_details(bool) to the thrift api 2) added "enable query details" and "disable query details" to the CLI 3) added queryDetails state to the ClientState to store the setting per connection 4) o.a.c.service.QueryContext as a container for the per query state. State is only created if the logging details are turned on for the query, and are stored in thread local storage and accessed through static methods. 5) CassandraServer was modified to call QueryContext.startQuery() and stopQuery() to setup and clear the thread local query context. This is done at the entry point for each thrift method that modifies / reads data. 6) o.a.c.concurrent.DebuggableThreadPoolExecutor was modified to copy the QueryContext if present and set it in the worker thread. The changes to the submit() and execute() methods mimic the implementation of java.util.concurrent.ThreadPool . This allows the context to flow from the connection thread to read / mutate thread pools as well as any other cross thread call a query makes. 7) o.a.c.net.Message and MessageDeliveryTask were modified to include the QueryContext in cross node messages. The context is only sent if present AND if the query started on the local node (this stops the context flowing back to the origin). Serialising the QueryContext leverages the fact that the Message will already have the IP address in it. 8) NOTE: The query context will only survive one hop through the cluster. 9) the slf4j-log412 jar has been removed and the code copied into the main tree. When slf4j initialises it looks for an implementation of org.slf4j.impl.StaticLoggerBinder , this will now be the implementation in the source tree rather than the jar in lib. 10) I modified the org.slf4j.impl.Log4jLoggerAdapter implementation, the Log4jLoggerFactory will wrap all log4j loggers in this adapter. The adapter now checks the thread local QueryContext and elevates all TRACE and DEBUG messages to info if query details have been enabled. As well as prepend each messages with either the "query_id@origin_ip" or "query_id@origin_ip message id". 11) Added a line to ColumnFamilyStore to log how many SSTables were read during the query. Here's an example of the log output for a get() call starting at node2 and calling node1, the prefix is "query 1@/127.0.0.2 - " and "query 1@/127.0.0.2 message 126 -" INFO [pool-1-thread-1] 2011-01-13 01:39:03,423 CassandraServer.java (line 329) query 1@/127.0.0.2 - get INFO [pool-1-thread-1] 2011-01-13 01:39:03,458 StorageService.java (line 1391) query 1@/127.0.0.2 - Sorted endpoints are /127.0.0.1 INFO [pool-1-thread-1] 2011-01-13 01:39:03,460 StorageProxy.java (line 377) query 1@/127.0.0.2 - weakread reading SliceByNamesReadCommand(table='Keyspace1', key=666f6f35, columnParent='QueryPath(columnFamilyName='Standard2', superColumnName='null', columnName='null')', columns= [bar,] ) from 126@/127.0.0.1 DEBUG [RequestResponseStage:1] 2011-01-13 01:39:03,489 ResponseVerbHandler.java (line 59) Processing response on an async result from 126@/127.0.0.1 INFO [pool-1-thread-1] 2011-01-13 01:39:03,499 QueryContext.java (line 71) query 1@/127.0.0.2 - returning to client, async processing may continue INFO [ReadStage:3] 2011-01-13 01:39:03,481 ColumnFamilyStore.java (line 1317) query 1@/127.0.0.2 message 126 - SSTables read 1 INFO [ReadStage:3] 2011-01-13 01:39:03,484 ReadVerbHandler.java (line 90) query 1@/127.0.0.2 message 126 - Read key 666f6f35; sending response to 126@/127.0.0.2 NOTE: The query details did not appear in the message on the RequestResponseStage thread because the context was not sent in the return message from node1. This is my first major ticket, so please let me know of any issues. I chose to replace the slf4j adapter to remove the need to chase down all the log calls and check the QueryContext.
          Hide
          Jonathan Ellis added a comment -

          still looks like the simplest solution is to add a thrift system_enable_query_details() method that would cause what we log at debug, to be logged at INFO for queries from that connection. We'd need to add this to the intra-cluster messages too, either as a part of the Message Header or (more lightweight) as a bit in the header computed by MessagingService.packIt. (Up until it hits MessagingService everything will be run on the same thread, so we can use a threadlocal like we do for keyspace in CassandraServer.)

          incidentally, i don't think we log anything about how many sstables are checked on a per-query basis (although we collect this globally in a histogram), so that would be good to add.

          Show
          Jonathan Ellis added a comment - still looks like the simplest solution is to add a thrift system_enable_query_details() method that would cause what we log at debug, to be logged at INFO for queries from that connection . We'd need to add this to the intra-cluster messages too, either as a part of the Message Header or (more lightweight) as a bit in the header computed by MessagingService.packIt. (Up until it hits MessagingService everything will be run on the same thread, so we can use a threadlocal like we do for keyspace in CassandraServer.) incidentally, i don't think we log anything about how many sstables are checked on a per-query basis (although we collect this globally in a histogram), so that would be good to add.
          Hide
          Jonathan Ellis added a comment -

          CASSANDRA-1355 added an RPCLogger to capture the timing at the CassandraServer / StorageProxy layer, but did not follow requests across machines

          Show
          Jonathan Ellis added a comment - CASSANDRA-1355 added an RPCLogger to capture the timing at the CassandraServer / StorageProxy layer, but did not follow requests across machines
          Hide
          Jonathan Ellis added a comment -

          CASSANDRA-1305 explored storing times for different query steps in an in-memory structure for later retrieval (by jmx?)

          Show
          Jonathan Ellis added a comment - CASSANDRA-1305 explored storing times for different query steps in an in-memory structure for later retrieval (by jmx?)
          Hide
          Jonathan Ellis added a comment -

          possibly, but see CASSANDRA-1430 for why this may not be a good idea

          Show
          Jonathan Ellis added a comment - possibly, but see CASSANDRA-1430 for why this may not be a good idea
          Hide
          Matthew F. Dennis added a comment -

          I'd really like to see the results of this, as well as any other tracing/metrics, stored in a CF

          Show
          Matthew F. Dennis added a comment - I'd really like to see the results of this, as well as any other tracing/metrics, stored in a CF

            People

            • Assignee:
              David Alves
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development