Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core, Tools
    • Labels:
      None

      Description

      "Write sample mode" gets us part way to testing new versions against a real world workload, but we need an easy way to test the query side as well.

      1. 6572-trunk.diff
        29 kB
        Lyuben Todorov

        Activity

        Jonathan Ellis created issue -
        Hide
        Jonathan Ellis added a comment -

        IMO the two options to serialize incoming requests at are

        1. QueryProcessor
        2. StorageProxy

        The good thing about StorageProxy is that it's the common substrate for Thrift and CQL. The bad thing is that QP itself is complex enough to have regressions. So I'd lean towards ignoring Thrift and make this a CQL-only tool.

        Show
        Jonathan Ellis added a comment - IMO the two options to serialize incoming requests at are QueryProcessor StorageProxy The good thing about StorageProxy is that it's the common substrate for Thrift and CQL. The bad thing is that QP itself is complex enough to have regressions. So I'd lean towards ignoring Thrift and make this a CQL-only tool.
        Hide
        Jonathan Ellis added a comment -

        As for the UI, I think a JMX method to enable recording every Nth query is enough to start with, and then a separate tool to run those queries against another cluster.

        Suggest saving a timestamp as well as the query itself so that we can playback with the same timing as the original as well as "as fast as you can run them."

        Tagging this for 2.0 but we can revise that if it turns out to be more invasive than I think.

        Show
        Jonathan Ellis added a comment - As for the UI, I think a JMX method to enable recording every Nth query is enough to start with, and then a separate tool to run those queries against another cluster. Suggest saving a timestamp as well as the query itself so that we can playback with the same timing as the original as well as "as fast as you can run them." Tagging this for 2.0 but we can revise that if it turns out to be more invasive than I think.
        Jonathan Ellis made changes -
        Field Original Value New Value
        Fix Version/s 2.0.5 [ 12325761 ]
        Hide
        Marcus Eriksson added a comment -

        Jimmy Mårdell something for you perhaps?

        Show
        Marcus Eriksson added a comment - Jimmy Mårdell something for you perhaps?
        Hide
        Jimmy Mårdell added a comment -

        Sounds interesting, but it's not something I will have time to work on in the near future at least.

        Show
        Jimmy Mårdell added a comment - Sounds interesting, but it's not something I will have time to work on in the near future at least.
        Jonathan Ellis made changes -
        Assignee Lyuben Todorov [ lyubent ]
        Sylvain Lebresne made changes -
        Fix Version/s 2.0.6 [ 12326170 ]
        Fix Version/s 2.0.5 [ 12325761 ]
        Hide
        Lyuben Todorov added a comment -

        I think it would be easier for users to understand what's going on if we record the CQL query string in QP#processStatement and pass it to a function is SP (so the majority of the work can be done in SP but still allow us to capture the CQL string which is easy to understand) and then save that to a system table (not though about the name yet) along with the timestamp of execution this will give us a good starting point.

        Show
        Lyuben Todorov added a comment - I think it would be easier for users to understand what's going on if we record the CQL query string in QP#processStatement and pass it to a function is SP (so the majority of the work can be done in SP but still allow us to capture the CQL string which is easy to understand) and then save that to a system table (not though about the name yet) along with the timestamp of execution this will give us a good starting point.
        Hide
        Jonathan Ellis added a comment -

        Why split the work across two classes instead of doing it all in QP?

        Show
        Jonathan Ellis added a comment - Why split the work across two classes instead of doing it all in QP?
        Hide
        Lyuben Todorov added a comment -

        I thought it would make more sense to have this kind of functionality in StorageProxy, but it makes sense to keep it simple by only coding in QP and it will be better if someone else wishes to extend the functionality of the workload recording.

        Show
        Lyuben Todorov added a comment - I thought it would make more sense to have this kind of functionality in StorageProxy, but it makes sense to keep it simple by only coding in QP and it will be better if someone else wishes to extend the functionality of the workload recording.
        Hide
        Jonathan Ellis added a comment -

        Do you have any strong feelings here Aleksey Yeschenko on the QP/SP divide?

        Show
        Jonathan Ellis added a comment - Do you have any strong feelings here Aleksey Yeschenko on the QP/SP divide?
        Hide
        Jeremiah Jordan added a comment -

        i would think saving this to append only files, is a better option than a system table. Also we probably want to record some kind of timing with this for replay.

        Show
        Jeremiah Jordan added a comment - i would think saving this to append only files, is a better option than a system table. Also we probably want to record some kind of timing with this for replay.
        Hide
        Jonathan Ellis added a comment -

        Agreed on both counts.

        Show
        Jonathan Ellis added a comment - Agreed on both counts.
        Hide
        Aleksey Yeschenko added a comment -

        Do you have any strong feelings here Aleksey Yeschenko on the QP/SP divide?

        This belongs to QP.

        Show
        Aleksey Yeschenko added a comment - Do you have any strong feelings here Aleksey Yeschenko on the QP/SP divide? This belongs to QP.
        Hide
        Lyuben Todorov added a comment -

        Just a though, do we want to ignore queries on system/trace tables? Such queries might make users wonder users as to why they did 30 inserts (assuming they record every query) but their log shows 34 queries. It will be fairly clear when they look at the log, but its worth considering.

        Show
        Lyuben Todorov added a comment - Just a though, do we want to ignore queries on system/trace tables? Such queries might make users wonder users as to why they did 30 inserts (assuming they record every query) but their log shows 34 queries. It will be fairly clear when they look at the log, but its worth considering.
        Hide
        Jonathan Ellis added a comment -

        Yes, we should ignore those (because replaying the log at the same trace settings should result in the same load).

        Show
        Jonathan Ellis added a comment - Yes, we should ignore those (because replaying the log at the same trace settings should result in the same load).
        Hide
        Lyuben Todorov added a comment -

        Just an update with progress so far, branch here. Currently query recording is enabled using JMX via StorageService#enableQueryRecording. This will record every nth query and append it to the QueryLog. Currently the log is piggybacking on the commit_log directory but the long-term plan is to either add a setting to cassandra.yaml or modify enableQueryRecording to take another param that will be the directory for the log.

        Queries are stored in an append only log as suggested, once the log reaches 4MB (this should also be configurable, I'm thinking set a default and add a JMX function that can overload said default). Once the limit is reached, the log is renamed (a timestamp is added to the name) and a new log is created that will now store the new appends.

        As for the replaying, its still fairly basic, currently the entire query log file is read in a single operation (smarter approaches than storing a large collection of query strings are welcome), then the collection of queries is replayed sequentially. If there is too large a gap between queries, a timeout takes effect to avoid stalling the replay (again this should be configurable, currently the timeout is 10s). Replaying of the workload is invoked via JMX right now, but workload replayer tool will be in charge of handling the replaying, and also what cluster the logs get replayed to.

        Show
        Lyuben Todorov added a comment - Just an update with progress so far, branch here . Currently query recording is enabled using JMX via StorageService#enableQueryRecording. This will record every nth query and append it to the QueryLog. Currently the log is piggybacking on the commit_log directory but the long-term plan is to either add a setting to cassandra.yaml or modify enableQueryRecording to take another param that will be the directory for the log. Queries are stored in an append only log as suggested, once the log reaches 4MB (this should also be configurable, I'm thinking set a default and add a JMX function that can overload said default). Once the limit is reached, the log is renamed (a timestamp is added to the name) and a new log is created that will now store the new appends. As for the replaying, its still fairly basic, currently the entire query log file is read in a single operation (smarter approaches than storing a large collection of query strings are welcome), then the collection of queries is replayed sequentially. If there is too large a gap between queries, a timeout takes effect to avoid stalling the replay (again this should be configurable, currently the timeout is 10s). Replaying of the workload is invoked via JMX right now, but workload replayer tool will be in charge of handling the replaying, and also what cluster the logs get replayed to.
        Sylvain Lebresne made changes -
        Fix Version/s 2.0.7 [ 12326472 ]
        Fix Version/s 2.0.6 [ 12326170 ]
        Hide
        Lyuben Todorov added a comment -

        Is there a better alternative to using the Cassandra.Client to execute the queries once replay starts? I though I can use the java driver, but the driver isn't picked up by the ant build / ant stress-build tasks (doesn't see the dependency although cassandra-driver-core-2.0.0.jar is in C*/tools/lib) AFAIK we didn't add the driver into lib/ because it will increase size of the binary distribution too much.

        Show
        Lyuben Todorov added a comment - Is there a better alternative to using the Cassandra.Client to execute the queries once replay starts? I though I can use the java driver, but the driver isn't picked up by the ant build / ant stress-build tasks (doesn't see the dependency although cassandra-driver-core-2.0.0.jar is in C*/tools/lib ) AFAIK we didn't add the driver into lib/ because it will increase size of the binary distribution too much.
        Sylvain Lebresne made changes -
        Fix Version/s 2.0.8 [ 12326712 ]
        Fix Version/s 2.0.7 [ 12326472 ]
        Hide
        Lyuben Todorov added a comment -

        The process goes like this:

        1. We enable recording via JMX SS#enableQueryRecording
        2. Insert n number of queries
        3. Replay queries to a new cluster using tools/bin/workloadreplayer

        Running through an example:

        1. JMX Call to SS#enableQueryRecording where we supply parameters of 5 for 5MB log limit, 4 for record ever 1/4 queries and finally /var/lib/cassadra/querylog as the directory for the logs
        2. Insert 100k rows
          This should result in 2 query logs, one of which is 5mb and has been renamed to store a timestamp in its name, the other will be named QueryLog.log. Between the two logs there should be 25k queries.
        3. Replaying the logs is done via the replay tool (workloadreplayer) where we first supply the directory of the query logs and then various flags (see git branch here) e.g:
          ./tools/bin/workloadreplayer /Users/lyubentodorov/Desktop/Log/ -t 1000000

        Concerns:
        Two synchronize blocks (one in QueryProcessor#maybeLogQuery and the other in QueryRecorder#append) have been added on the read path, but since these blocks will only be hit when query logging is enabled it shouldn't hinder performance where it matters most.
        I've used the thrift client so I'm not sure if queries routing will be optimal.

        Feature branch here, also attaching a patch for trunk. I'll patch this for cassandra-2.0 tomorrow

        Show
        Lyuben Todorov added a comment - The process goes like this: We enable recording via JMX SS#enableQueryRecording Insert n number of queries Replay queries to a new cluster using tools/bin/workloadreplayer Running through an example: JMX Call to SS#enableQueryRecording where we supply parameters of 5 for 5MB log limit, 4 for record ever 1/4 queries and finally /var/lib/cassadra/querylog as the directory for the logs Insert 100k rows This should result in 2 query logs, one of which is 5mb and has been renamed to store a timestamp in its name, the other will be named QueryLog.log. Between the two logs there should be 25k queries. Replaying the logs is done via the replay tool (workloadreplayer) where we first supply the directory of the query logs and then various flags ( see git branch here ) e.g: ./tools/bin/workloadreplayer /Users/lyubentodorov/Desktop/Log/ -t 1000000 Concerns: Two synchronize blocks (one in QueryProcessor#maybeLogQuery and the other in QueryRecorder#append ) have been added on the read path, but since these blocks will only be hit when query logging is enabled it shouldn't hinder performance where it matters most. I've used the thrift client so I'm not sure if queries routing will be optimal. Feature branch here , also attaching a patch for trunk. I'll patch this for cassandra-2.0 tomorrow
        Lyuben Todorov made changes -
        Attachment 6572-trunk.diff [ 12639673 ]
        Lyuben Todorov made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Jonathan Ellis added a comment -

        How do you deal w/ prepared vs non-prepared queries? Thinking of CASSANDRA-7021 here.

        Show
        Jonathan Ellis added a comment - How do you deal w/ prepared vs non-prepared queries? Thinking of CASSANDRA-7021 here.
        Hide
        Jonathan Ellis added a comment -

        WDYT Tyler Hobbs, is this uninvasive enough to make it into 2.0?

        Show
        Jonathan Ellis added a comment - WDYT Tyler Hobbs , is this uninvasive enough to make it into 2.0?
        Jonathan Ellis made changes -
        Reviewer Tyler Hobbs [ thobbs ]
        Hide
        Tyler Hobbs added a comment -

        It's a pretty safe patch, but as a non-essential feature I think it should be reserved for 2.1 or 3.0.

        Show
        Tyler Hobbs added a comment - It's a pretty safe patch, but as a non-essential feature I think it should be reserved for 2.1 or 3.0.
        Hide
        Aleksey Yeschenko added a comment -

        I'd say 3.0, with 2.1 being so close, and delayed.

        Show
        Aleksey Yeschenko added a comment - I'd say 3.0, with 2.1 being so close, and delayed.
        Hide
        Benedict added a comment -

        Lyuben Todorov a few comments/suggestions about the patch:

        1. In the query recorder, it would make most sense to keep a writer handle open, instead of re-opening the file every time you append a new query. Ideally, we would probably have a buffer of some kind we write to in-memory, that we swap when we have flush to disk so that other queries can continue to log to the buffer without being impeded by the flush.
        2. It's quite wasteful to convert the query string to base64 encoded bytes, and then to convert that back into a string. Should write the bytes straight into the new buffer
        3. Since you're using an AtomicInteger, there's no need to use a lock there: can simply increment the counter and check the result (modulo frequency) to see if we should log. No need to reset to zero.
        Show
        Benedict added a comment - Lyuben Todorov a few comments/suggestions about the patch: In the query recorder, it would make most sense to keep a writer handle open, instead of re-opening the file every time you append a new query. Ideally, we would probably have a buffer of some kind we write to in-memory, that we swap when we have flush to disk so that other queries can continue to log to the buffer without being impeded by the flush. It's quite wasteful to convert the query string to base64 encoded bytes, and then to convert that back into a string. Should write the bytes straight into the new buffer Since you're using an AtomicInteger, there's no need to use a lock there: can simply increment the counter and check the result (modulo frequency) to see if we should log. No need to reset to zero.
        Hide
        Jonathan Ellis added a comment -

        It's a pretty safe patch, but as a non-essential feature I think it should be reserved for 2.1 or 3.0.

        Let's do 2.1.1 then.

        Show
        Jonathan Ellis added a comment - It's a pretty safe patch, but as a non-essential feature I think it should be reserved for 2.1 or 3.0. Let's do 2.1.1 then.
        Jonathan Ellis made changes -
        Fix Version/s 2.1 [ 12324159 ]
        Fix Version/s 2.0.8 [ 12326712 ]
        Jonathan Ellis made changes -
        Fix Version/s 2.1.1 [ 12326774 ]
        Fix Version/s 2.1.0 [ 12324159 ]
        Hide
        Tyler Hobbs added a comment -

        In addition to Benedict's comments, there are a couple of other things to change:

        • Substring checks won't cut it for isSystemOrTraceKS(). Instead, let the query get parsed first and then check its keyspace. Unfortunately, it doesn't look like the Statement subclasses expose any keyspace-related info right now, so you may need to add that yourself.
        • When replaying, don't read every file upfront. I would make an Iterator that transparently iterates over the files and reads one query string at a time
        • As Benedict mentioned, you don't want to base64 the query string. I'm guessing you did that to avoid spaces and newlines. Instead, just write the length of the query string out as an integer before writing the query string. When reading, you'll first read the length and then you can just read that many bytes.
        Show
        Tyler Hobbs added a comment - In addition to Benedict's comments, there are a couple of other things to change: Substring checks won't cut it for isSystemOrTraceKS() . Instead, let the query get parsed first and then check its keyspace. Unfortunately, it doesn't look like the Statement subclasses expose any keyspace-related info right now, so you may need to add that yourself. When replaying, don't read every file upfront. I would make an Iterator that transparently iterates over the files and reads one query string at a time As Benedict mentioned, you don't want to base64 the query string. I'm guessing you did that to avoid spaces and newlines. Instead, just write the length of the query string out as an integer before writing the query string. When reading, you'll first read the length and then you can just read that many bytes.
        Hide
        Benedict added a comment -

        Thanks Tyler Hobbs. There are a few other things to address with this ticket on top of both of our comments here - I'm going to work with Lyuben offline on it once he's back. I don't want to steal your reviewer hat, but you're welcome to look at other things instead

        Show
        Benedict added a comment - Thanks Tyler Hobbs . There are a few other things to address with this ticket on top of both of our comments here - I'm going to work with Lyuben offline on it once he's back. I don't want to steal your reviewer hat, but you're welcome to look at other things instead
        Hide
        Tyler Hobbs added a comment -

        Alright, I've set you as the reviewer. I do have plenty of other things on my plate

        Show
        Tyler Hobbs added a comment - Alright, I've set you as the reviewer. I do have plenty of other things on my plate
        Tyler Hobbs made changes -
        Reviewer Tyler Hobbs [ thobbs ] Benedict [ benedict ]
        Hide
        Jonathan Ellis added a comment -

        Suggest also storing whatever client or connection id we have; seems like we'd want to preserve thread relationships on playback.

        Show
        Jonathan Ellis added a comment - Suggest also storing whatever client or connection id we have; seems like we'd want to preserve thread relationships on playback.
        Hide
        Lyuben Todorov added a comment -

        Once a switch is made to a buffer of queries is kept in memory rather than opening and writing to a file for each query, I'm wondering if there should be some timing on the log so it gets flushed to disk after x amount of time. Arguments against is that when the queue is full or query recording is disabled the log gets flushed but if no querying is going on and Cassandra gets shut down the log will be lost. Simpler way of handling that is to handle that case in the shutdown process, but then during the time that the query log remains enabled it will be consuming memory.

        Show
        Lyuben Todorov added a comment - Once a switch is made to a buffer of queries is kept in memory rather than opening and writing to a file for each query, I'm wondering if there should be some timing on the log so it gets flushed to disk after x amount of time. Arguments against is that when the queue is full or query recording is disabled the log gets flushed but if no querying is going on and Cassandra gets shut down the log will be lost. Simpler way of handling that is to handle that case in the shutdown process, but then during the time that the query log remains enabled it will be consuming memory.
        Hide
        Benedict added a comment -

        I wouldn't worry about the negligible amount of memory wasted storing the data - after all, if there are no new messages to log it means the server isn't serving any requests. A shutdown hook is probably easiest and sufficient, as a periodic flush would also miss any messages logged between last flush and shutdown.

        Show
        Benedict added a comment - I wouldn't worry about the negligible amount of memory wasted storing the data - after all, if there are no new messages to log it means the server isn't serving any requests. A shutdown hook is probably easiest and sufficient, as a periodic flush would also miss any messages logged between last flush and shutdown.
        Hide
        Lyuben Todorov added a comment - - edited

        Since queries are logged as their string form, this doesn't accommodate prepared statements. One way I think we can log the ps' is to log the string query during the prepare phase along with the query's id, e.g. b7693b50da63a31229b8413754bc72c0 INSERT INTO ks.cf (col1) VALUES ( ? ) and then in ExecuteMessage#execute the values and the id can be logged again later on in the log, then during replay we can match values to the queryString by using the id. A better way to do it would be to get access to the statementId in QP#executePrepared but I'm not sure whether it's worth changing the statement to store it's id.

        Show
        Lyuben Todorov added a comment - - edited Since queries are logged as their string form, this doesn't accommodate prepared statements. One way I think we can log the ps' is to log the string query during the prepare phase along with the query's id, e.g. b7693b50da63a31229b8413754bc72c0 INSERT INTO ks.cf (col1) VALUES ( ? ) and then in ExecuteMessage#execute the values and the id can be logged again later on in the log, then during replay we can match values to the queryString by using the id. A better way to do it would be to get access to the statementId in QP#executePrepared but I'm not sure whether it's worth changing the statement to store it's id.
        Hide
        Benedict added a comment -

        It might be easiest to modify QP.processPrepared to accept the statementId as an extra parameter, to keep everything encapsulated in QP

        Show
        Benedict added a comment - It might be easiest to modify QP.processPrepared to accept the statementId as an extra parameter, to keep everything encapsulated in QP
        Hide
        Lyuben Todorov added a comment -

        I added a commit (5494...ea44) where:

        1. base64 encoding / decoding is removed
        2. the logfile is now a byte[] that gets flushed to disk once full

        Next up is working in prepared statements and logging client threading info.

        Show
        Lyuben Todorov added a comment - I added a commit ( 5494...ea44 ) where: base64 encoding / decoding is removed the logfile is now a byte[] that gets flushed to disk once full Next up is working in prepared statements and logging client threading info.
        Hide
        Benedict added a comment - - edited

        Some comments on the in progress patch:

        • Don't create a string with the header and convert it to bytes - convert the string to bytes and write a normal byte-encoded header with timestamp + length as longs. This will make encoding the prepared statement parameters much easier also
        • Encapsulate queryQue and logPosition into a single object, and use an atomicinteger for the position - don't synchronise, just bump the position however much you need, then write to the owned range. On flush swap the object (use an AtomicReference to track the current buffer)
        • On flush, append directly from the byte buffer, don't copy it. Create a FileOutputStream and call its appropriate write method with the range that is in use
        • On the read path, you're now eagerly reading all files which is likely to blow up the heap; at least create an Iterator that only reads a whole file at once (preferably read a chunk of a file at a time, with a BufferedInputStream)
        • On replay timing we want to target hitting the same delta from epoch for running the query, not the delta from the prior query - this should help prevent massive timing drifts
        • Query frequency can be an int rather than an Integer to avoid unboxing
        • I think it would be nice if we checked the actual CFMetaData for the keyspaces we're modifying in the CQLStatement, rather than doing a find within the whole string, but it's not too big a deal
        • atomicCounterLock needs to be removed
        • As a general rule, never copy array contents with a loop - always use System.arraycopy
        • Still need to log the thread + session id as Jonathan mentioned
        Show
        Benedict added a comment - - edited Some comments on the in progress patch: Don't create a string with the header and convert it to bytes - convert the string to bytes and write a normal byte-encoded header with timestamp + length as longs. This will make encoding the prepared statement parameters much easier also Encapsulate queryQue and logPosition into a single object, and use an atomicinteger for the position - don't synchronise, just bump the position however much you need, then write to the owned range. On flush swap the object (use an AtomicReference to track the current buffer) On flush, append directly from the byte buffer, don't copy it. Create a FileOutputStream and call its appropriate write method with the range that is in use On the read path, you're now eagerly reading all files which is likely to blow up the heap; at least create an Iterator that only reads a whole file at once (preferably read a chunk of a file at a time, with a BufferedInputStream) On replay timing we want to target hitting the same delta from epoch for running the query, not the delta from the prior query - this should help prevent massive timing drifts Query frequency can be an int rather than an Integer to avoid unboxing I think it would be nice if we checked the actual CFMetaData for the keyspaces we're modifying in the CQLStatement, rather than doing a find within the whole string, but it's not too big a deal atomicCounterLock needs to be removed As a general rule, never copy array contents with a loop - always use System.arraycopy Still need to log the thread + session id as Jonathan mentioned
        Hide
        Lyuben Todorov added a comment -

        Pushed changes to 2d9f57d708d67d2c3e191e36daabc2bf35426386. The commit deals with:

        1. Creating the header using a buffer and then writing a Long for the timestamp and an Int for the query length + bytes of the actual query string
        2. Switched query frequency to an int
        3. Moved queryQue (also renamed it to queryQueue) and logPosition into a separate object, also now tracking the queryQueue with an AtomicReference, (still need to swap objects when the log is full rather than reseting position and emptying the byte[])
        4. removed atomicCounterLock
        5. changed the flush process to using an OutputStream without looping over the buffer array.
        6. stopped looping over arrays when appending to the query buffer, now using System.arrayCopy instead.
        7. changed the workload replay to (still need to read one file at a time, and break files up into segments)
        Show
        Lyuben Todorov added a comment - Pushed changes to 2d9f57d708d67d2c3e191e36daabc2bf35426386 . The commit deals with: Creating the header using a buffer and then writing a Long for the timestamp and an Int for the query length + bytes of the actual query string Switched query frequency to an int Moved queryQue (also renamed it to queryQueue ) and logPosition into a separate object, also now tracking the queryQueue with an AtomicReference , (still need to swap objects when the log is full rather than reseting position and emptying the byte[]) removed atomicCounterLock changed the flush process to using an OutputStream without looping over the buffer array. stopped looping over arrays when appending to the query buffer, now using System.arrayCopy instead. changed the workload replay to (still need to read one file at a time, and break files up into segments)
        Hide
        Benedict added a comment -

        Thanks for the update. A couple of nits:

        • No need to create a DataOutputStream for writing the byte[] - the FOS will be sufficient
        • Convert the query-string to bytes first, and then determine the length, as there could be unicode characters present that occupy more than one byte per character
        • For legibility, it would be nice to create a specific object for storing the loaded queries, instead of a Pair<Long, byte[]> - and you may as well convert it to a string earlier like you were before
        • Don't forget to make the append asynchronous (i.e. not synchronized, by exploiting the position being an AtomicInteger to permit multiple queries getting logged simultaneously). Can look in CommitLogSegment.allocate() for an example of how this can be done.
        Show
        Benedict added a comment - Thanks for the update. A couple of nits: No need to create a DataOutputStream for writing the byte[] - the FOS will be sufficient Convert the query-string to bytes first, and then determine the length, as there could be unicode characters present that occupy more than one byte per character For legibility, it would be nice to create a specific object for storing the loaded queries, instead of a Pair<Long, byte[]> - and you may as well convert it to a string earlier like you were before Don't forget to make the append asynchronous (i.e. not synchronized , by exploiting the position being an AtomicInteger to permit multiple queries getting logged simultaneously). Can look in CommitLogSegment.allocate() for an example of how this can be done.
        Hide
        Lyuben Todorov added a comment -

        I'm not too sure about the process of making the appends asynch for the workload recording so I'd like to get some feedback on my idea before i go any furher, it goes like this:

        1. Calculate how much space will be required from the query queue in a separate function
        2. move the queue pointer ahead by that value
        3. submit the append in a separate thead where this is done using an executor that gets shut down when the query logging is disabled or C* is shutdown.

        And ofc remove the "synchronized" from QueryRecorded#append

        Show
        Lyuben Todorov added a comment - I'm not too sure about the process of making the appends asynch for the workload recording so I'd like to get some feedback on my idea before i go any furher, it goes like this: Calculate how much space will be required from the query queue in a separate function move the queue pointer ahead by that value submit the append in a separate thead where this is done using an executor that gets shut down when the query logging is disabled or C* is shutdown. And ofc remove the "synchronized" from QueryRecorded#append
        Hide
        Benedict added a comment - - edited

        Points 1 and 2 need to be atomic, so you calculate how much room you need, then in a loop you get the current value of the queue pointer, check there is room in the queue for the record, and then cas the current value to the new value (old + size). If it succeeds you exit the loop and then you can write to the buffer. No need to offload to another thread. The flush needs to make sure that all prior appends have finished before flushing though, and the best way to do this is with an OpOrder. See CommitLogSegment for examples. The appends would do something like:

        OpOrder.Group opGroup = order.start();
        try
        {
         int size = calcSize();
         QueryQueue q; 
         int position;
         while (true)
         {
          q = currentQueue
          int position = q.currentPosition
          if (position + size < q.buffer.length && q.currentPosition.compareAndSet(position, position + size))
           break;
         }
         doAppend(q, position);
        } 
        finally
        {
         opGroup.close()
        }
        

        and the flush would do something like

                OpOrder.Barrier barrier = order.newBarrier();
                barrier.issue();
                barrier.await();
                doFlush();
        
        Show
        Benedict added a comment - - edited Points 1 and 2 need to be atomic, so you calculate how much room you need, then in a loop you get the current value of the queue pointer, check there is room in the queue for the record, and then cas the current value to the new value (old + size). If it succeeds you exit the loop and then you can write to the buffer. No need to offload to another thread. The flush needs to make sure that all prior appends have finished before flushing though, and the best way to do this is with an OpOrder. See CommitLogSegment for examples. The appends would do something like: OpOrder.Group opGroup = order.start(); try { int size = calcSize(); QueryQueue q; int position; while ( true ) { q = currentQueue int position = q.currentPosition if (position + size < q.buffer.length && q.currentPosition.compareAndSet(position, position + size)) break ; } doAppend(q, position); } finally { opGroup.close() } and the flush would do something like OpOrder.Barrier barrier = order.newBarrier(); barrier.issue(); barrier.await(); doFlush();
        Hide
        Lyuben Todorov added a comment -

        Updated the branch on 54e579b to remove sync from appends. Due to QueryRecorder's package change the diff isn't very clear so I'll add this gist as well to make changes more visible. Next up is exposing CFMetaData to CQLStatements, handling prepared statements, logging thread + session id info and segmenting log reads on replay.

        Show
        Lyuben Todorov added a comment - Updated the branch on 54e579b to remove sync from appends. Due to QueryRecorder 's package change the diff isn't very clear so I'll add this gist as well to make changes more visible. Next up is exposing CFMetaData to CQLStatements, handling prepared statements, logging thread + session id info and segmenting log reads on replay.
        Hide
        Benedict added a comment -

        Thanks - some comments on the latest patch:

        1. QR.queryQueue and QQ.limit should both be final
        2. Your allocate(String) method and allocate(int) method need to be merged (part of allocate(int) can be moved into the QQ); allocate(String) needs to know which QQ it has performed allocate(int) against; as it stands there is a race where it allocates against one and writes to another. At the start of the loop you should get the QQ, allocate against it, on failure you attempt to runFlush() and then immediately loop to the start and try again
        3. runFlush is still synchronized; you should cas the QQ from the one you know is full to a new one, and if successful you know you can perform the flush
        4. I would move the flush itself onto a separate thread still - a TPE with 0 core threads and 1 max should suffice
        5. There are some races in your initing/swapping of queues; I suggest making the byte[] final in QQ, allocating it in the constructor, and allocating the first queue by checking if it is null and cas-ing to a new queue; look at SlabAllocator for an example of this. Use a ConcurrentLinkedQueue to store any QQs allocated but failed to cas, and put any finished flushing QQs onto this CLQ when done so that we quickly reach an equilibrium point. We can then make them DirectByteBuffers at some point easily as well.
        6. There's no need to close/reopen your OpOrder once you've made these changes
        Show
        Benedict added a comment - Thanks - some comments on the latest patch: QR.queryQueue and QQ.limit should both be final Your allocate(String) method and allocate(int) method need to be merged (part of allocate(int) can be moved into the QQ); allocate(String) needs to know which QQ it has performed allocate(int) against; as it stands there is a race where it allocates against one and writes to another. At the start of the loop you should get the QQ, allocate against it, on failure you attempt to runFlush() and then immediately loop to the start and try again runFlush is still synchronized; you should cas the QQ from the one you know is full to a new one, and if successful you know you can perform the flush I would move the flush itself onto a separate thread still - a TPE with 0 core threads and 1 max should suffice There are some races in your initing/swapping of queues; I suggest making the byte[] final in QQ, allocating it in the constructor, and allocating the first queue by checking if it is null and cas-ing to a new queue; look at SlabAllocator for an example of this. Use a ConcurrentLinkedQueue to store any QQs allocated but failed to cas, and put any finished flushing QQs onto this CLQ when done so that we quickly reach an equilibrium point. We can then make them DirectByteBuffers at some point easily as well. There's no need to close/reopen your OpOrder once you've made these changes
        Hide
        Lyuben Todorov added a comment - - edited

        put any finished flushing QQs onto this CLQ when done so that we quickly reach an equilibrium point.

        Why do we want to store completed QQs onto the CLQ used for storing failed pre-flush CAS? Also what would be the best time to poll the CQL of failed flush QQs? I'd say on CAS success of the current queue submit multiple flushes in separate threads until the CLQ of failed QQs is emptied out.

        Other than that commit with everything else in the comment on c38080f.

        Show
        Lyuben Todorov added a comment - - edited put any finished flushing QQs onto this CLQ when done so that we quickly reach an equilibrium point. Why do we want to store completed QQs onto the CLQ used for storing failed pre-flush CAS? Also what would be the best time to poll the CQL of failed flush QQs? I'd say on CAS success of the current queue submit multiple flushes in separate threads until the CLQ of failed QQs is emptied out. Other than that commit with everything else in the comment on c38080f .
        Hide
        Benedict added a comment -

        Why do we want to store completed QQs onto the CLQ used for storing failed pre-flush CAS

        Because they're both empty and we can reuse them (resetting their position to zero) without allocating new QQs. It's not a failedFlushQueue, it's a recycled queue

        Some comments:

        1. In QQ.allocate, no need to check pos + size > 0; just assert that pos > 0 on insert (pos < 0 would break even with your extra check)
        2. Do not compareAndSet the first queue in an assert; if assertions are disabled it won't happen
        3. No need to define q outside of your loop
        4. Need to compareAndSet() the QQ position to the end of the queue before flushing; should do this after CASing q to a new QQ, so that you have exclusive access for flushing q, so the position you get is definitely the end position.
        5. It's a little neater to have your position >= 0 clause first, so you can simply return, and don't need a large nested if with a continue at the end
        6. Looks like runFlush() is no longer needed
        7. QQ.limit is unnecessary
        8. QQ.logPosition should be final
        Show
        Benedict added a comment - Why do we want to store completed QQs onto the CLQ used for storing failed pre-flush CAS Because they're both empty and we can reuse them (resetting their position to zero) without allocating new QQs. It's not a failedFlushQueue, it's a recycled queue Some comments: In QQ.allocate, no need to check pos + size > 0; just assert that pos > 0 on insert (pos < 0 would break even with your extra check) Do not compareAndSet the first queue in an assert; if assertions are disabled it won't happen No need to define q outside of your loop Need to compareAndSet() the QQ position to the end of the queue before flushing; should do this after CASing q to a new QQ, so that you have exclusive access for flushing q, so the position you get is definitely the end position. It's a little neater to have your position >= 0 clause first, so you can simply return, and don't need a large nested if with a continue at the end Looks like runFlush() is no longer needed QQ.limit is unnecessary QQ.logPosition should be final
        Hide
        Lyuben Todorov added a comment - - edited

        Commit 3bd3ddc:

        1. runFlush() is used in SS to force a flush should the query recording be disabled \/ cassandra gets shut down, but I renamed it to forceFlush for clarity.
        2. added an interface that gives us access to a statement's keyspace from CFMetaData. isSystemOrTraceKS now uses the keyspace from the CFMetaData\. Batch statements are special-cased as they contain more than just a single keyspace (might possibly change this further along once we get to replaying of batches). (separate commit here)
        3. Updated QR to recycle QQs which get stored onto the CLQ.
        4. Added a small optimisation, BBs used for creating the data array in QR#allocate(String) are now recycled where the largest buffer created is stored, when a bigger one is required it gets allocated and replaces the older one.
        5. Added forceFlush to the shutdown process.

        P.S. The more this develops, the more it seems like QR should be a singleton, I think it wont be a big change at all to modify it to be one, not that the current design is a problem, just an idea to consider.

        Show
        Lyuben Todorov added a comment - - edited Commit 3bd3ddc : runFlush() is used in SS to force a flush should the query recording be disabled \/ cassandra gets shut down, but I renamed it to forceFlush for clarity. added an interface that gives us access to a statement's keyspace from CFMetaData. isSystemOrTraceKS now uses the keyspace from the CFMetaData\. Batch statements are special-cased as they contain more than just a single keyspace (might possibly change this further along once we get to replaying of batches). (separate commit here ) Updated QR to recycle QQs which get stored onto the CLQ. Added a small optimisation, BBs used for creating the data array in QR#allocate(String) are now recycled where the largest buffer created is stored, when a bigger one is required it gets allocated and replaces the older one. Added forceFlush to the shutdown process. P.S. The more this develops, the more it seems like QR should be a singleton, I think it wont be a big change at all to modify it to be one, not that the current design is a problem, just an idea to consider.
        Hide
        Benedict added a comment -

        Added a small optimisation, BBs used for creating the data array in QR#allocate(String) are now recycled where the largest buffer created is stored, when a bigger one is required it gets allocated and replaces the older one.

        Instead of this, it is better to simply write the data directly to the QQ; you can wrap the appropriate range in a ByteBuffer and return the ByteBuffer from the allocate() method (null indicating there wasn't enough room)

        1. Your QQ recycling code is broken. This should only be performed by the flushing thread, once we know the flush has finished, or if we fail to swap the QQ (i.e. if line 85 returns false)
        2. you should use getAndSet() on line 87, the result being the value pos should take
        3. I wouldn't introduce a special interface for AccessibleKeyspace, and I wouldn't consider batch statements to be magically safe. Batch mutations have a set of CFs they maintain of their modifications, you can check these for system keyspaces, and if any are present avoid logging the whole batch. We don't want to stuff up the system keyspaces somehow on replay. I would suggest potentially introducing a boolean method to CQLStatement that returns true if the statement is safe to log; there are a small number of abstract classes that would need implementations (AuthStmt, CFStmt, ModStmt)

        It occurs to me, on top of logging the thread id as Jonathan suggests, we also need to log the client session id as well, else "use" statements could cause absolute chaos. This means we may need to log disconnect of client sessions as well as a special record to permit state cleanup.

        Show
        Benedict added a comment - Added a small optimisation, BBs used for creating the data array in QR#allocate(String) are now recycled where the largest buffer created is stored, when a bigger one is required it gets allocated and replaces the older one. Instead of this, it is better to simply write the data directly to the QQ; you can wrap the appropriate range in a ByteBuffer and return the ByteBuffer from the allocate() method (null indicating there wasn't enough room) Your QQ recycling code is broken. This should only be performed by the flushing thread, once we know the flush has finished, or if we fail to swap the QQ (i.e. if line 85 returns false) you should use getAndSet() on line 87, the result being the value pos should take I wouldn't introduce a special interface for AccessibleKeyspace, and I wouldn't consider batch statements to be magically safe. Batch mutations have a set of CFs they maintain of their modifications, you can check these for system keyspaces, and if any are present avoid logging the whole batch. We don't want to stuff up the system keyspaces somehow on replay. I would suggest potentially introducing a boolean method to CQLStatement that returns true if the statement is safe to log; there are a small number of abstract classes that would need implementations (AuthStmt, CFStmt, ModStmt) It occurs to me, on top of logging the thread id as Jonathan suggests, we also need to log the client session id as well, else "use" statements could cause absolute chaos. This means we may need to log disconnect of client sessions as well as a special record to permit state cleanup.
        Hide
        Lyuben Todorov added a comment -

        Pushed updates to 1916883:

        1. QQ now uses a ByteBuffer for storing queries
        2. Added a boolean CQLStatement#isSystemOrTraceKS and removed the AccessibleKeyspace interface.
        3. Reading one file at a time during replay.
        Show
        Lyuben Todorov added a comment - Pushed updates to 1916883 : QQ now uses a ByteBuffer for storing queries Added a boolean CQLStatement#isSystemOrTraceKS and removed the AccessibleKeyspace interface. Reading one file at a time during replay.
        Hide
        Benedict added a comment -
        • You need to move the getAndSet() on line 76 back inside the if statement
        • recycleQueue souldn't compareAndSet its position to zero, it should just set it always, and always recycle
        • I'd probably make a static helper method for checking if a keyspace is one we want to avoid tracing

        Otherwise this batch of changes LGTM

        Show
        Benedict added a comment - You need to move the getAndSet() on line 76 back inside the if statement recycleQueue souldn't compareAndSet its position to zero, it should just set it always, and always recycle I'd probably make a static helper method for checking if a keyspace is one we want to avoid tracing Otherwise this batch of changes LGTM
        Benedict made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Lyuben Todorov added a comment -

        Added support for prepared statements on 26e3541. PS replay doesn't fully work yet as statements with bound variables are being executed as soon as they are read in from the log rather than being placed on the queue for replay but there's a todo in the code (will fix in next commit).

        Show
        Lyuben Todorov added a comment - Added support for prepared statements on 26e3541 . PS replay doesn't fully work yet as statements with bound variables are being executed as soon as they are read in from the log rather than being placed on the queue for replay but there's a todo in the code (will fix in next commit).
        Hide
        Benedict added a comment -

        A few brief comments:

        1. bb.limit() is probably not what you want to be calling; bb.remaining() tells you how many bytes there are to read in the provided BB
        2. you appear to be allocating an empty byte[] and not writing anything to it, however this is an unnecessary step. you can call variableClone.put(bb.duplicate())
        3. You don't appear to be doing anything with variablesClone? maybe redundant dead code you forgot to remove?
        4. I would avoid calling duplicate() into varsClone; I would simply pass in the original vars, and call duplicate() when you use them inside of append()
        Show
        Benedict added a comment - A few brief comments: bb.limit() is probably not what you want to be calling; bb.remaining() tells you how many bytes there are to read in the provided BB you appear to be allocating an empty byte[] and not writing anything to it, however this is an unnecessary step. you can call variableClone.put(bb.duplicate()) You don't appear to be doing anything with variablesClone? maybe redundant dead code you forgot to remove? I would avoid calling duplicate() into varsClone; I would simply pass in the original vars, and call duplicate() when you use them inside of append()
        Hide
        Lyuben Todorov added a comment -

        variablesClone was used for debugging the values being stored for replay so it was indeed redundant, pushing 6228df6 that makes replay of prepared statements functional, now string statements and prepared statements can be mixed and replayed. The commit also addressed the above issues.

        Next up is logging of thread & session ids.

        Show
        Lyuben Todorov added a comment - variablesClone was used for debugging the values being stored for replay so it was indeed redundant, pushing 6228df6 that makes replay of prepared statements functional, now string statements and prepared statements can be mixed and replayed. The commit also addressed the above issues. Next up is logging of thread & session ids.
        Hide
        Lyuben Todorov added a comment -

        Added thread id to the logs in b2c4503, planning on changing the Iterable<QeuryLogSegment> to Map<Integer, Iterable<QeuryLogSegment>> so that each entry represents a single thread's collection of querylog segments (with the integer being the thread id as recorded in the logs) and then submitting each iterable as a runnable task to an executor service.

        The one thing I struggled finding is session information, we have access to new connections in o.a.c.transport.Server#ConnectionTracker#addConnection but beyond that I didn't see where session information could be accessed.

        Show
        Lyuben Todorov added a comment - Added thread id to the logs in b2c4503 , planning on changing the Iterable<QeuryLogSegment> to Map<Integer, Iterable<QeuryLogSegment>> so that each entry represents a single thread's collection of querylog segments (with the integer being the thread id as recorded in the logs) and then submitting each iterable as a runnable task to an executor service. The one thing I struggled finding is session information, we have access to new connections in o.a.c.transport.Server#ConnectionTracker#addConnection but beyond that I didn't see where session information could be accessed.
        Hide
        Lyuben Todorov added a comment -

        Summary of commit 85b83c2:

        1. Updated WR#read to create a multimap of thread ids and their respective QuerylogSegments to allow for recreating the multi-threaded environment which executed the queries.
        2. Added an executor service that runs each of the threads built in WR#read.
        3. Removed static context from WR
        Show
        Lyuben Todorov added a comment - Summary of commit 85b83c2 : Updated WR#read to create a multimap of thread ids and their respective QuerylogSegments to allow for recreating the multi-threaded environment which executed the queries. Added an executor service that runs each of the threads built in WR#read . Removed static context from WR
        Hide
        Benedict added a comment - - edited

        It looks to me like you need some way to share the statement preparation across threads, as it can be used by any thread (and across log segments) once prepared. Probably easiest to do it during parsing of the log file.

        We also have an issue with replay potentially over-parallelizing, and also potentially OOMing, as you're submitting straight to a thread pool after parsing each file. So there's nothing stopping us racing ahead and reading all of the log files (you have an unbounded queue), but since you submit each file separately you will spawn a thread/executor for each thread/segment combination, rather than each thread.

        Probably we want to create some separate state to represent a thread, which we create once the first time we see a thread id, insert it into a map, and then place work directly onto this queue during parsing of all segments. We can submit a runnable immediately for processing this queue to represent a thread. We have a potential problem here, though, which is that we do not know if a thread died, so we can fill up the executor pool, so we for now let's use an unbounded executorpool and leave tackling this properly until we have everything else in place. We should then limit how many queries ahead we can read to prevent OOM.

        Also, we're still replaying based on offset from last query, which means we will skew very quickly. We should be fixing an epoch (in nanos) such that you have a log epoch of L, and queries are run at T=L+X; when re-run we have a replay epoch of R, and we run queries at R+X

        Show
        Benedict added a comment - - edited It looks to me like you need some way to share the statement preparation across threads, as it can be used by any thread (and across log segments) once prepared. Probably easiest to do it during parsing of the log file. We also have an issue with replay potentially over-parallelizing, and also potentially OOMing, as you're submitting straight to a thread pool after parsing each file. So there's nothing stopping us racing ahead and reading all of the log files (you have an unbounded queue), but since you submit each file separately you will spawn a thread/executor for each thread/segment combination, rather than each thread. Probably we want to create some separate state to represent a thread, which we create once the first time we see a thread id, insert it into a map, and then place work directly onto this queue during parsing of all segments. We can submit a runnable immediately for processing this queue to represent a thread. We have a potential problem here, though, which is that we do not know if a thread died, so we can fill up the executor pool, so we for now let's use an unbounded executorpool and leave tackling this properly until we have everything else in place. We should then limit how many queries ahead we can read to prevent OOM. Also, we're still replaying based on offset from last query, which means we will skew very quickly. We should be fixing an epoch (in nanos) such that you have a log epoch of L, and queries are run at T=L+X; when re-run we have a replay epoch of R, and we run queries at R+X
        Hide
        Lyuben Todorov added a comment -

        It looks to me like you need some way to share the statement preparation across threads, as it can be used by any thread (and across log segments) once prepared. Probably easiest to do it during parsing of the log file

        Seems simple enough, creating a concurrent map that is shared across a WorkloadReplayer should do the job. The problem posed with doing it whilst parsing the log is that the statement might be for a ks / cf that isn't yet created

        We also have an issue with replay potentially over-parallelizing, and also potentially OOMing, as you're submitting straight to a thread pool after parsing each file. So there's nothing stopping us racing ahead and reading all of the log files (you have an unbounded queue)

        Possible solution is to move the multimap at the class level rather than having WP#read creating one each time it's called (again per WorkloadReplayer which is fine since we should only have 1 per replay). Then every time a read is completed we submit the collection of QuerylogSegments to be replayed, empty the map and populate it again if the same thread-id is met in WP#read. The tricky part is submitting the same thread-id only once we know the executor doesn't have a task with the same thread-id already running.

        Also, we're still replaying based on offset from last query, which means we will skew very quickly. We should be fixing an epoch (in nanos) such that you have a log epoch of L, and queries are run at T=L+X; when re-run we have a replay epoch of R, and we run queries at R+X

        It's on the todo list.

        Show
        Lyuben Todorov added a comment - It looks to me like you need some way to share the statement preparation across threads, as it can be used by any thread (and across log segments) once prepared. Probably easiest to do it during parsing of the log file Seems simple enough, creating a concurrent map that is shared across a WorkloadReplayer should do the job. The problem posed with doing it whilst parsing the log is that the statement might be for a ks / cf that isn't yet created We also have an issue with replay potentially over-parallelizing, and also potentially OOMing, as you're submitting straight to a thread pool after parsing each file. So there's nothing stopping us racing ahead and reading all of the log files (you have an unbounded queue) Possible solution is to move the multimap at the class level rather than having WP#read creating one each time it's called (again per WorkloadReplayer which is fine since we should only have 1 per replay). Then every time a read is completed we submit the collection of QuerylogSegments to be replayed, empty the map and populate it again if the same thread-id is met in WP#read . The tricky part is submitting the same thread-id only once we know the executor doesn't have a task with the same thread-id already running. Also, we're still replaying based on offset from last query, which means we will skew very quickly. We should be fixing an epoch (in nanos) such that you have a log epoch of L, and queries are run at T=L+X; when re-run we have a replay epoch of R, and we run queries at R+X It's on the todo list.
        Benedict made changes -
        Assignee Lyuben Todorov [ lyubent ]
        Jonathan Ellis made changes -
        Assignee Carl Yeksigian [ carlyeks ]
        Hide
        Jonathan Ellis added a comment -

        Have you had a chance to look at this, Carl?

        Show
        Jonathan Ellis added a comment - Have you had a chance to look at this, Carl?
        Jonathan Ellis made changes -
        Reviewer Benedict [ benedict ]
        Hide
        Carl Yeksigian added a comment -

        I have a rebase of the branch here, but haven't yet gotten around to fixing the issues pointed out.

        Overall, though, I'm not sure what we are trying to accomplish with these additions that can't already be accomplished with the new stress tool, especially since this will be sampling, and might not even be running when the event that you wanted to replay took place.

        If we are using this to replicate workloads for replaying a workload, then the new stress tool can be better configured to do that. With the sampling, the workload wouldn't be the same as it was when we were capturing it, so we won't be able to replicate it exactly.

        If there is a specific query that is failing, trying to reproduce it and pull together a workload replay is probably going to involve steps similar to putting together a stress profile.

        One other thing about using a stress profile instead of a workload replay is that it would be more open for people to use, since it wouldn't include any sensitive information.

        I can continue moving this forward, but I'm skeptical about it being enough of a benefit for the additional risk it entails versus the stress tool we already have.

        Show
        Carl Yeksigian added a comment - I have a rebase of the branch here , but haven't yet gotten around to fixing the issues pointed out. Overall, though, I'm not sure what we are trying to accomplish with these additions that can't already be accomplished with the new stress tool, especially since this will be sampling, and might not even be running when the event that you wanted to replay took place. If we are using this to replicate workloads for replaying a workload, then the new stress tool can be better configured to do that. With the sampling, the workload wouldn't be the same as it was when we were capturing it, so we won't be able to replicate it exactly. If there is a specific query that is failing, trying to reproduce it and pull together a workload replay is probably going to involve steps similar to putting together a stress profile. One other thing about using a stress profile instead of a workload replay is that it would be more open for people to use, since it wouldn't include any sensitive information. I can continue moving this forward, but I'm skeptical about it being enough of a benefit for the additional risk it entails versus the stress tool we already have.
        Hide
        Aleksey Yeschenko added a comment -

        I agree with Carl's points here.

        We also might want to create a downgradesstables tool, to make trying out new versions even easier (by making reverting back easier).

        Show
        Aleksey Yeschenko added a comment - I agree with Carl's points here. We also might want to create a downgradesstables tool, to make trying out new versions even easier (by making reverting back easier).
        Hide
        Jonathan Ellis added a comment -

        We also might want to create a downgradesstables tool, to make trying out new versions even easier (by making reverting back easier).

        I'm even more excited about using this to get us realistic workloads to benchmark, than I am about helping people upgrade with confidence. So, downgradesstables is a nice idea but I don't think it obsoletes this.

        If we are using this to replicate workloads for replaying a workload, then the new stress tool can be better configured to do that.

        Okay, so maybe we shift our goals a bit. Assuming we can tune newstress to do 50% inserts to table X, 10% reads from table Y, 40% updates to table Z, doing that by hand is a pain. What if we make this tool gather statistics like that from a live cluster, then write out a stress schema + config to emulate it?

        Show
        Jonathan Ellis added a comment - We also might want to create a downgradesstables tool, to make trying out new versions even easier (by making reverting back easier). I'm even more excited about using this to get us realistic workloads to benchmark, than I am about helping people upgrade with confidence. So, downgradesstables is a nice idea but I don't think it obsoletes this. If we are using this to replicate workloads for replaying a workload, then the new stress tool can be better configured to do that. Okay, so maybe we shift our goals a bit. Assuming we can tune newstress to do 50% inserts to table X, 10% reads from table Y, 40% updates to table Z, doing that by hand is a pain. What if we make this tool gather statistics like that from a live cluster, then write out a stress schema + config to emulate it?
        Hide
        Jonathan Shook added a comment -

        I believe that we do need a tool to help characterize real workloads. There is a distinct difference between a suggestive combination tests and something that is based on the access patterns, timings, data distribution, etc of a real working system. While the new stress tool is nifty, it is a stress tool, not a workload characterization tool. There are many systems in use which have access patterns that are not trivially easy to test with a service mix. Also, understanding the behavior of clients and the cluster in-situ can be very helpful in various ways-- troubleshooting, capacity planning, scale testing, etc. Lots of time is spent just getting users to the point where they understand what their system is doing. A proper sample of real behavior is the best possible data to start from.

        I'm in favor of keeping some form of real workload sampling on the table. Even if the stress tool is enhanced, there is still a workload characterization problem to solve here, which I believe is true to the spirit of the original request.

        Show
        Jonathan Shook added a comment - I believe that we do need a tool to help characterize real workloads. There is a distinct difference between a suggestive combination tests and something that is based on the access patterns, timings, data distribution, etc of a real working system. While the new stress tool is nifty, it is a stress tool, not a workload characterization tool. There are many systems in use which have access patterns that are not trivially easy to test with a service mix. Also, understanding the behavior of clients and the cluster in-situ can be very helpful in various ways-- troubleshooting, capacity planning, scale testing, etc. Lots of time is spent just getting users to the point where they understand what their system is doing. A proper sample of real behavior is the best possible data to start from. I'm in favor of keeping some form of real workload sampling on the table. Even if the stress tool is enhanced, there is still a workload characterization problem to solve here, which I believe is true to the spirit of the original request.
        Sylvain Lebresne made changes -
        Fix Version/s 2.1.2 [ 12328841 ]
        Fix Version/s 2.1.1 [ 12326774 ]
        T Jake Luciani made changes -
        Fix Version/s 2.1.3 [ 12328951 ]
        Fix Version/s 2.1.2 [ 12328841 ]
        Jeremiah Jordan made changes -
        Fix Version/s 3.0 [ 12324945 ]
        Fix Version/s 2.1.3 [ 12328951 ]
        Hide
        Jonathan Ellis added a comment -

        Created CASSANDRA-8929 for followup on sampling.

        Show
        Jonathan Ellis added a comment - Created CASSANDRA-8929 for followup on sampling.
        Jonathan Ellis made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Carl Yeksigian [ carlyeks ]
        Fix Version/s 3.0 [ 12324945 ]
        Resolution Won't Fix [ 2 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        89d 23h 45m 1 Lyuben Todorov 10/Apr/14 23:32
        Patch Available Patch Available Open Open
        97d 8h 58m 1 Benedict 17/Jul/14 08:30
        Open Open Resolved Resolved
        232d 8h 27m 1 Jonathan Ellis 06/Mar/15 15:58

          People

          • Assignee:
            Unassigned
            Reporter:
            Jonathan Ellis
          • Votes:
            2 Vote for this issue
            Watchers:
            19 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development