Cassandra
  1. Cassandra
  2. CASSANDRA-4261

[patch] Support consistency-latency prediction in nodetool

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 2
    • Component/s: Tools
    • Labels:
      None

      Description

      Introduction

      Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

      This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

      What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via nodetool predictconsistency:

      nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions

      Example output
      
      //N == ReplicationFactor
      //R == read ConsistencyLevel
      //W == write ConsistencyLevel
      
      user@test:$ nodetool predictconsistency 3 100 1
      Performing consistency prediction
      100ms after a given write, with maximum version staleness of k=1
      N=3, R=1, W=1
      Probability of consistent reads: 0.678900
      Average read latency: 5.377900ms (99.900th %ile 40ms)
      Average write latency: 36.971298ms (99.900th %ile 294ms)
      
      N=3, R=1, W=2
      Probability of consistent reads: 0.791600
      Average read latency: 5.372500ms (99.900th %ile 39ms)
      Average write latency: 303.630890ms (99.900th %ile 357ms)
      
      N=3, R=1, W=3
      Probability of consistent reads: 1.000000
      Average read latency: 5.426600ms (99.900th %ile 42ms)
      Average write latency: 1382.650879ms (99.900th %ile 629ms)
      
      N=3, R=2, W=1
      Probability of consistent reads: 0.915800
      Average read latency: 11.091000ms (99.900th %ile 348ms)
      Average write latency: 42.663101ms (99.900th %ile 284ms)
      
      N=3, R=2, W=2
      Probability of consistent reads: 1.000000
      Average read latency: 10.606800ms (99.900th %ile 263ms)
      Average write latency: 310.117615ms (99.900th %ile 335ms)
      
      N=3, R=3, W=1
      Probability of consistent reads: 1.000000
      Average read latency: 52.657501ms (99.900th %ile 565ms)
      Average write latency: 39.949799ms (99.900th %ile 237ms)
      

      Demo

      Here's an example scenario you can run using ccm. The prediction is fast:

      cd <cassandra-source-dir with patch applied>
      ant
      
      ccm create consistencytest --cassandra-dir=. 
      ccm populate -n 5
      ccm start
      
      # if start fails, you might need to initialize more loopback interfaces
      # e.g., sudo ifconfig lo0 alias 127.0.0.2
      
      # use stress to get some sample latency data
      tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
      tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
      
      bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
      

      What and Why

      We've implemented Probabilistically Bounded Staleness, a new technique for predicting consistency-latency trade-offs within Cassandra. Our paper will appear in VLDB 2012, and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

      This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevel). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

      We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

      Interface

      This patch allows users to perform this prediction in production using nodetool.

      Users enable tracing of latency data by calling enableConsistencyPredictionLogging() in the PBSPredictorMBean.

      Cassandra logs a variable number of latencies (controllable via JMX (setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged), default: 10000). Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is 32*logged_latencies bytes of memory for the predicting node.

      nodetool predictconsistency predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running setNumberTrialsForConsistencyPrediction(int numTrials) Monte Carlo trials per configuration (default: 10000).

      Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

      Implementation

      This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

      Latency Data

      We log latency data in service.PBSPredictor, recording four relevant distributions:

      • W: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
      • A: time from when a replica accepting a mutation sends an
      • R: time from when the coordinator sends a read request to the time that the replica performs the read
      • S: time from when the replica sends a read response to the time when the coordinator receives it

      We augment net.MessageIn and net.MessageOut to store timestamps along with every message (8 bytes overhead required for millisecond long). In net.MessagingService, we log the start of every mutation and read, and, in net.ResponseVerbHandler, we log the end of every mutation and read. Jonathan Ellis mentioned that 1123 had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

      Prediction

      When prompted by nodetool, we call service.PBSPredictor.doPrediction, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

      Testing

      We've modified the unit test for SerializationsTest and provided a new unit test for PBSPredictor (PBSPredictorTest). You can run the PBSPredictor test with ant pbs-test.

      Overhead

      This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing net.MessageIn and net.MessageOut serialization at runtime, which is messy.

      If enabled, consistency tracing requires 32*logged_latencies bytes of memory on the node on which tracing is enabled.

      Caveats

      The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:

      • We do not account for read repair.
      • We do not account for Merkle tree exchange.
      • Multi-version staleness is particularly conservative.

      The predictions are optimistic in the following ways:

      • We do not predict the impact of node failure.
      • We do not model hinted handoff.

      We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key. (See discussion below.)

      Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

      Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

      We can potentially improve these if there's interest, but this is an area of active research.


      Peter Bailis and Shivaram Venkataraman
      pbailis@cs.berkeley.edu
      shivaram@cs.berkeley.edu

      1. pbs-nodetool-v3.patch
        50 kB
        Shivaram Venkataraman
      2. demo-pbs-v3.sh
        1 kB
        Shivaram Venkataraman
      3. 4261-v6.txt
        44 kB
        Shivaram Venkataraman
      4. 4261-v5.txt
        45 kB
        Jonathan Ellis
      5. 4261-v4.txt
        44 kB
        Jonathan Ellis

        Activity

        Peter Bailis created issue -
        Peter Bailis made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Peter Bailis made changes -
        Attachment pbs-nodetool-v1.patch [ 12528209 ]
        Peter Bailis made changes -
        Description

        .h1 Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        .h1 Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        .h1 What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        .h1 Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        .h2 Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        .h2 Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        .h2 Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        .h2 Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        .h1 Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.

        h1. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h1. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h1. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h1. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h2. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h2. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h2. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h2. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h2. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description
        h1. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h1. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h1. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h1. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h2. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h2. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h2. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h2. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h2. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h2. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h2. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h2. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h2. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h3. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h3. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h3. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h3. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h2. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Comment [ Last commit to Cassandra fork for this patch is at https://github.com/pbailis/cassandra-pbs/commit/6e0ac68b43a7e6692423abf760edf88d633dd04d ]
        Peter Bailis made changes -
        Description h2. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h2. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h2. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h2. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h3. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h3. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h3. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h3. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h2. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        This patch allows users to perform this prediction in production using {{nodetool}}. Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies (each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node) and then predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration. Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}\\

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}\\

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}\\\\

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}\\\\

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\\\

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\\\

        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? nodetool predictconsistency provides this:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can talk about how to improve these if you're interested. This is an area of active research.
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.

        ----

        Peter Bailis and Shivaram Venkataraman
        pbailis@cs.berkeley.edu
        shivaram@cs.berkeley.edu
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.

        ----

        Peter Bailis and Shivaram Venkataraman
        pbailis@cs.berkeley.edu
        shivaram@cs.berkeley.edu
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.

        ----

        Peter Bailis and Shivaram Venkataraman
        pbailis@cs.berkeley.edu
        shivaram@cs.berkeley.edu
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.

        ----

        Peter Bailis and Shivaram Venkataraman
        pbailis@cs.berkeley.edu
        shivaram@cs.berkeley.edu
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

         The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch, exposed by {{nodetool predictconsistency}} provides answers:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than trying out different configurations (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}s). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Summary [Patch] Support consistency-latency prediction in nodetool [patch] Support consistency-latency prediction in nodetool
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Attachment demo-pbs.sh [ 12528697 ]
        Hide
        Peter Bailis added a comment -

        I've provided a bash script that performs a full end-to-end demonstration of this patch, in case you didn't want to pull a clean source tree and patch it, then copy and paste the commands above. The script clones Cassandra trunk, applies the patch, then spins up and profiles at local 5 node cluster using ccm as above. The script isn't robust, but it should be easy enough to debug. Enjoy!

        Show
        Peter Bailis added a comment - I've provided a bash script that performs a full end-to-end demonstration of this patch, in case you didn't want to pull a clean source tree and patch it, then copy and paste the commands above. The script clones Cassandra trunk, applies the patch, then spins up and profiles at local 5 node cluster using ccm as above. The script isn't robust, but it should be easy enough to debug. Enjoy!
        Hide
        Peter Bailis added a comment -

        Update to patch. Fixed a bug where, if two reads happen with the same latency, we make sure to treat them separately. This required a two-line change that effectively excludes reads we've considered for a given trial. Also added a check in the test for this case.

        Show
        Peter Bailis added a comment - Update to patch. Fixed a bug where, if two reads happen with the same latency, we make sure to treat them separately. This required a two-line change that effectively excludes reads we've considered for a given trial. Also added a check in the test for this case.
        Peter Bailis made changes -
        Attachment pbs-nodetool-v2.patch [ 12531065 ]
        Peter Bailis made changes -
        Attachment pbs-nodetool-v1.patch [ 12528209 ]
        Hide
        Peter Bailis added a comment -

        Updated hyperlink in demo script.

        Show
        Peter Bailis added a comment - Updated hyperlink in demo script.
        Peter Bailis made changes -
        Attachment demo-pbs-v2.sh [ 12531067 ]
        Hide
        Peter Bailis added a comment -

        Fixing sample output so it's correct.

        Show
        Peter Bailis added a comment - Fixing sample output so it's correct.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.811700
        Average read latency: 6.896300ms (99.900th %ile 174ms)
        Average write latency: 8.788000ms (99.900th %ile 252ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.867200
        Average read latency: 6.818200ms (99.900th %ile 152ms)
        Average write latency: 33.226101ms (99.900th %ile 420ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 6.766800ms (99.900th %ile 111ms)
        Average write latency: 153.764999ms (99.900th %ile 969ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.951500
        Average read latency: 18.065800ms (99.900th %ile 414ms)
        Average write latency: 8.322600ms (99.900th %ile 232ms)

        N=3, R=2, W=2
        Probability of consistent reads: 0.983000
        Average read latency: 18.009001ms (99.900th %ile 387ms)
        Average write latency: 35.797100ms (99.900th %ile 478ms)

        N=3, R=3, W=1
        Probability of consistent reads: 0.993900
        Average read latency: 101.959702ms (99.900th %ile 1094ms)
        Average write latency: 8.518600ms (99.900th %ile 236ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        Performing consistency prediction
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.678900
        Average read latency: 5.377900ms (99.900th %ile 40ms)
        Average write latency: 36.971298ms (99.900th %ile 294ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.791600
        Average read latency: 5.372500ms (99.900th %ile 39ms)
        Average write latency: 303.630890ms (99.900th %ile 357ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 5.426600ms (99.900th %ile 42ms)
        Average write latency: 1382.650879ms (99.900th %ile 629ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.915800
        Average read latency: 11.091000ms (99.900th %ile 348ms)
        Average write latency: 42.663101ms (99.900th %ile 284ms)

        N=3, R=2, W=2
        Probability of consistent reads: 1.000000
        Average read latency: 10.606800ms (99.900th %ile 263ms)
        Average write latency: 310.117615ms (99.900th %ile 335ms)

        N=3, R=3, W=1
        Probability of consistent reads: 1.000000
        Average read latency: 52.657501ms (99.900th %ile 565ms)
        Average write latency: 39.949799ms (99.900th %ile 237ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Attachment demo-pbs.sh [ 12528697 ]
        Hide
        Jonathan Ellis added a comment -

        Off the top of my head, it looks like performance impact should be negligible. Does that match your test results?

        If so I'd be fine with dropping the extra config settings entirely. (log_latencies_for_consistency_prediction only makes sense if overhead is meaningful, and the others only if users are expert enough in PBS to adjust them intelligently, which doesn't sound realistic to me.

        Show
        Jonathan Ellis added a comment - Off the top of my head, it looks like performance impact should be negligible. Does that match your test results? If so I'd be fine with dropping the extra config settings entirely. (log_latencies_for_consistency_prediction only makes sense if overhead is meaningful, and the others only if users are expert enough in PBS to adjust them intelligently, which doesn't sound realistic to me.
        Hide
        Peter Bailis added a comment -

        re: performance, we haven't noticed anything, but we also haven't done much serious load testing. I agree that there shouldn't be much overhead, and the only thing I can think of possibly being a problem would be contention in the ConcurrentHashMap that maps requestIDs to lists of latencies. However, this really shouldn't be a problem. To quantify this, I can run and report numbers for something like stress on an EC2 cluster. Would that work? Are there existing performance regression tests? If you have a preference for a different workload or configuration, let me know.

        re: the other config file settings, max_logged_latencies_for_consistency_prediction is possibly useful. Because we use a LRU policy for the latency logging, the number of latencies logged indirectly determines the window of time for sampling. If you want to capture a longer trace of network behavior, you'd increase the window, and if you wanted to do some on-the-fly tuning, you might shorten it. However, we could easily set this as a runtime configuration via nodetool instead.

        Show
        Peter Bailis added a comment - re: performance, we haven't noticed anything, but we also haven't done much serious load testing. I agree that there shouldn't be much overhead, and the only thing I can think of possibly being a problem would be contention in the ConcurrentHashMap that maps requestIDs to lists of latencies. However, this really shouldn't be a problem. To quantify this, I can run and report numbers for something like stress on an EC2 cluster. Would that work? Are there existing performance regression tests? If you have a preference for a different workload or configuration, let me know. re: the other config file settings, max_logged_latencies_for_consistency_prediction is possibly useful. Because we use a LRU policy for the latency logging, the number of latencies logged indirectly determines the window of time for sampling. If you want to capture a longer trace of network behavior, you'd increase the window, and if you wanted to do some on-the-fly tuning, you might shorten it. However, we could easily set this as a runtime configuration via nodetool instead.
        Hide
        Jonathan Ellis added a comment -

        stress on ec2 would be a reasonable smoke test.

        would definitely prefer something tunable via JMX, even if not exposed to nodetool; cassandra.yaml changes require a node restart to take effect.

        Show
        Jonathan Ellis added a comment - stress on ec2 would be a reasonable smoke test. would definitely prefer something tunable via JMX, even if not exposed to nodetool; cassandra.yaml changes require a node restart to take effect.
        Hide
        Peter Bailis added a comment - - edited

        I agree that JMX would work better. I'll work on changing this configuration and will post performance numbers shortly. I should be able to have this done in a week or so (latency due to my schedule, not due to task difficulty).

        Show
        Peter Bailis added a comment - - edited I agree that JMX would work better. I'll work on changing this configuration and will post performance numbers shortly. I should be able to have this done in a week or so (latency due to my schedule, not due to task difficulty).
        Hide
        Shivaram Venkataraman added a comment -

        We've posted an updated patch:

        1.) PBSPredictor is now tunable via JMX. Latency collection can be enabled/disabled by calling enableConsistencyPredictionLogging()/disableConsistencyPredictionLogging() respectively. Also the number of latencies collected can be tuned by calling setNumberTrialsForConsistencyPrediction.

        2.) We performed benchmarking and optimized our logging to minimize overhead.

        We ran load tests by sending 1M queries using cassandra-stress on a EC2 with 4 m1.large instances with ephemeral storage formatted as XFS and a replication factor of three. We've posted more details and the scripts used for benchmarking at https://github.com/shivaram/cassandra-pbs-bench

        We compared three setups:
        1.) "trunk": without this patch
        2.) "no-pbs": patch applied but consistency prediction logging disabled
        3.) "pbs": patch applied and logging enabled

        We tested with ConsistencyLevel=ONE (R1W1) and ConsistencyLevel=QUORUM (R2W2). The average latency (ms) and standard deviation across five trials are below:

        R1W1 - Insert:

        trunk 10.31 0.090
        no-pbs 10.58 0.092
        pbs 11.21 0.107

        R1W1 - Read:

        trunk 9.11 0.067
        no-pbs 9.13 0.044
        pbs 9.27 0.015

        R2W2 - Insert:

        trunk 12.36 0.028
        no-pbs 12.44 0.072
        pbs 13.21 0.068

        R2W2 - Read

        trunk 12.41 0.136
        no-pbs 12.56 0.054
        pbs 12.79 0.099

        The latency overhead for inserts is around 0.9ms when PBS is turned on for both R1W1 and R2W2. We believe this is primarily due to the overhead of calling System.currentTimeMillis() for the start, finish of each message and also due to the overhead of 50 stress threadsinserting latency information in the ConcurrentHashMap.

        The overhead is around 0.2ms per query when PBS logging is turned off (max 1.65% overhead). This is because even though the logging is turned off, the creation time of each message is serialized in MessageIn.java and MessageOut.java. We can optimize this by adding an extra flag to the wire protocol and optionally sending the timestamp based on a flag (also configurable by JMX) if you prefer.

        Finally, the overhead for reads is lower than that for writes because reads are only sent to the nearest nodes and sending local messages avoids the PBS latency collection code path.

        Show
        Shivaram Venkataraman added a comment - We've posted an updated patch: 1.) PBSPredictor is now tunable via JMX. Latency collection can be enabled/disabled by calling enableConsistencyPredictionLogging() / disableConsistencyPredictionLogging() respectively. Also the number of latencies collected can be tuned by calling setNumberTrialsForConsistencyPrediction . 2.) We performed benchmarking and optimized our logging to minimize overhead. We ran load tests by sending 1M queries using cassandra-stress on a EC2 with 4 m1.large instances with ephemeral storage formatted as XFS and a replication factor of three. We've posted more details and the scripts used for benchmarking at https://github.com/shivaram/cassandra-pbs-bench We compared three setups: 1.) "trunk": without this patch 2.) "no-pbs": patch applied but consistency prediction logging disabled 3.) "pbs": patch applied and logging enabled We tested with ConsistencyLevel=ONE (R1W1) and ConsistencyLevel=QUORUM (R2W2). The average latency (ms) and standard deviation across five trials are below: R1W1 - Insert: trunk 10.31 0.090 no-pbs 10.58 0.092 pbs 11.21 0.107 R1W1 - Read: trunk 9.11 0.067 no-pbs 9.13 0.044 pbs 9.27 0.015 R2W2 - Insert: trunk 12.36 0.028 no-pbs 12.44 0.072 pbs 13.21 0.068 R2W2 - Read trunk 12.41 0.136 no-pbs 12.56 0.054 pbs 12.79 0.099 The latency overhead for inserts is around 0.9ms when PBS is turned on for both R1W1 and R2W2. We believe this is primarily due to the overhead of calling System.currentTimeMillis() for the start, finish of each message and also due to the overhead of 50 stress threadsinserting latency information in the ConcurrentHashMap. The overhead is around 0.2ms per query when PBS logging is turned off (max 1.65% overhead). This is because even though the logging is turned off, the creation time of each message is serialized in MessageIn.java and MessageOut.java. We can optimize this by adding an extra flag to the wire protocol and optionally sending the timestamp based on a flag (also configurable by JMX) if you prefer. Finally, the overhead for reads is lower than that for writes because reads are only sent to the nearest nodes and sending local messages avoids the PBS latency collection code path.
        Shivaram Venkataraman made changes -
        Attachment pbs-nodetool-v3.patch [ 12535875 ]
        Shivaram Venkataraman made changes -
        Attachment demo-pbs-v3.sh [ 12535877 ]
        Peter Bailis made changes -
        Attachment demo-pbs-v2.sh [ 12531067 ]
        Peter Bailis made changes -
        Attachment pbs-nodetool-v2.patch [ 12531065 ]
        Hide
        Jonathan Ellis added a comment -

        even though the logging is turned off, the creation time of each message is serialized in MessageIn.java and MessageOut.java

        This will also be useful for CASSANDRA-2858 and CASSANDRA-1123 so it's fine to leave it always-on.

        Show
        Jonathan Ellis added a comment - even though the logging is turned off, the creation time of each message is serialized in MessageIn.java and MessageOut.java This will also be useful for CASSANDRA-2858 and CASSANDRA-1123 so it's fine to leave it always-on.
        Hide
        Jonathan Ellis added a comment -

        sending local messages avoids the PBS latency collection code path

        Is that a bug or a feature? Shouldn't local latency count as well?

        Show
        Jonathan Ellis added a comment - sending local messages avoids the PBS latency collection code path Is that a bug or a feature? Shouldn't local latency count as well?
        Hide
        Shivaram Venkataraman added a comment -

        Local latency counts, but whether this is a bug or a feature depends on how often we are local. If we have a small cluster (say ReplicationFactor=3 with 3 nodes), it's possible that many requests can be fulfilled locally. If we have a large cluster (say ReplicationFactor=3 with 100 nodes), a much smaller number of requests will be fulfilled locally; each replica that is contacted will, on average, have to make a larger number of remote requests.

        We expected that remote operations are more common for larger clusters where consistency is a problem. One of our caveats is that we only simulate non-local reads and writes and assume that the coordinating Cassandra node is not a replica.

        Without knowing the client-to-replica request routing, it's difficult to make better predictions. As an alternative, we could use a heuristic (say, 30% are local operations occurring randomly on different nodes) or even try to profile this.

        Show
        Shivaram Venkataraman added a comment - Local latency counts, but whether this is a bug or a feature depends on how often we are local. If we have a small cluster (say ReplicationFactor=3 with 3 nodes), it's possible that many requests can be fulfilled locally. If we have a large cluster (say ReplicationFactor=3 with 100 nodes), a much smaller number of requests will be fulfilled locally; each replica that is contacted will, on average, have to make a larger number of remote requests. We expected that remote operations are more common for larger clusters where consistency is a problem. One of our caveats is that we only simulate non-local reads and writes and assume that the coordinating Cassandra node is not a replica. Without knowing the client-to-replica request routing, it's difficult to make better predictions. As an alternative, we could use a heuristic (say, 30% are local operations occurring randomly on different nodes) or even try to profile this.
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        Performing consistency prediction
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.678900
        Average read latency: 5.377900ms (99.900th %ile 40ms)
        Average write latency: 36.971298ms (99.900th %ile 294ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.791600
        Average read latency: 5.372500ms (99.900th %ile 39ms)
        Average write latency: 303.630890ms (99.900th %ile 357ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 5.426600ms (99.900th %ile 42ms)
        Average write latency: 1382.650879ms (99.900th %ile 629ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.915800
        Average read latency: 11.091000ms (99.900th %ile 348ms)
        Average write latency: 42.663101ms (99.900th %ile 284ms)

        N=3, R=2, W=2
        Probability of consistent reads: 1.000000
        Average read latency: 10.606800ms (99.900th %ile 263ms)
        Average write latency: 310.117615ms (99.900th %ile 335ms)

        N=3, R=3, W=1
        Probability of consistent reads: 1.000000
        Average read latency: 52.657501ms (99.900th %ile 565ms)
        Average write latency: 39.949799ms (99.900th %ile 237ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

        Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}} Monte Carlo trials per configuration.

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.
         * We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        Performing consistency prediction
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.678900
        Average read latency: 5.377900ms (99.900th %ile 40ms)
        Average write latency: 36.971298ms (99.900th %ile 294ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.791600
        Average read latency: 5.372500ms (99.900th %ile 39ms)
        Average write latency: 303.630890ms (99.900th %ile 357ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 5.426600ms (99.900th %ile 42ms)
        Average write latency: 1382.650879ms (99.900th %ile 629ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.915800
        Average read latency: 11.091000ms (99.900th %ile 348ms)
        Average write latency: 42.663101ms (99.900th %ile 284ms)

        N=3, R=2, W=2
        Probability of consistent reads: 1.000000
        Average read latency: 10.606800ms (99.900th %ile 263ms)
        Average write latency: 310.117615ms (99.900th %ile 335ms)

        N=3, R=3, W=1
        Probability of consistent reads: 1.000000
        Average read latency: 52.657501ms (99.900th %ile 565ms)
        Average write latency: 39.949799ms (99.900th %ile 237ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by calling {{enableConsistencyPredictionLogging()}} in the {{PBSPredictorMBean}}.

        Cassandra logs a variable number of latencies (controllable via JMX ({{setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged)}}, default: 10000). Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{setNumberTrialsForConsistencyPrediction(int numTrials)}} Monte Carlo trials per configuration (default: 10000).

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key. (See discussion below.)

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Peter Bailis made changes -
        Description h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        Performing consistency prediction
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.678900
        Average read latency: 5.377900ms (99.900th %ile 40ms)
        Average write latency: 36.971298ms (99.900th %ile 294ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.791600
        Average read latency: 5.372500ms (99.900th %ile 39ms)
        Average write latency: 303.630890ms (99.900th %ile 357ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 5.426600ms (99.900th %ile 42ms)
        Average write latency: 1382.650879ms (99.900th %ile 629ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.915800
        Average read latency: 11.091000ms (99.900th %ile 348ms)
        Average write latency: 42.663101ms (99.900th %ile 284ms)

        N=3, R=2, W=2
        Probability of consistent reads: 1.000000
        Average read latency: 10.606800ms (99.900th %ile 263ms)
        Average write latency: 310.117615ms (99.900th %ile 335ms)

        N=3, R=3, W=1
        Probability of consistent reads: 1.000000
        Average read latency: 52.657501ms (99.900th %ile 565ms)
        Average write latency: 39.949799ms (99.900th %ile 237ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        # turn on consistency logging
        sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by calling {{enableConsistencyPredictionLogging()}} in the {{PBSPredictorMBean}}.

        Cassandra logs a variable number of latencies (controllable via JMX ({{setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged)}}, default: 10000). Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{setNumberTrialsForConsistencyPrediction(int numTrials)}} Monte Carlo trials per configuration (default: 10000).

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key. (See discussion below.)

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        h3. Introduction

        Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?

        This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

        What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:

        {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
        \\ \\
        {code:title=Example output|borderStyle=solid}

        //N == ReplicationFactor
        //R == read ConsistencyLevel
        //W == write ConsistencyLevel

        user@test:$ nodetool predictconsistency 3 100 1
        Performing consistency prediction
        100ms after a given write, with maximum version staleness of k=1
        N=3, R=1, W=1
        Probability of consistent reads: 0.678900
        Average read latency: 5.377900ms (99.900th %ile 40ms)
        Average write latency: 36.971298ms (99.900th %ile 294ms)

        N=3, R=1, W=2
        Probability of consistent reads: 0.791600
        Average read latency: 5.372500ms (99.900th %ile 39ms)
        Average write latency: 303.630890ms (99.900th %ile 357ms)

        N=3, R=1, W=3
        Probability of consistent reads: 1.000000
        Average read latency: 5.426600ms (99.900th %ile 42ms)
        Average write latency: 1382.650879ms (99.900th %ile 629ms)

        N=3, R=2, W=1
        Probability of consistent reads: 0.915800
        Average read latency: 11.091000ms (99.900th %ile 348ms)
        Average write latency: 42.663101ms (99.900th %ile 284ms)

        N=3, R=2, W=2
        Probability of consistent reads: 1.000000
        Average read latency: 10.606800ms (99.900th %ile 263ms)
        Average write latency: 310.117615ms (99.900th %ile 335ms)

        N=3, R=3, W=1
        Probability of consistent reads: 1.000000
        Average read latency: 52.657501ms (99.900th %ile 565ms)
        Average write latency: 39.949799ms (99.900th %ile 237ms)
        {code}

        h3. Demo

        Here's an example scenario you can run using [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

        {code:borderStyle=solid}
        cd <cassandra-source-dir with patch applied>
        ant

        ccm create consistencytest --cassandra-dir=.
        ccm populate -n 5
        ccm start

        # if start fails, you might need to initialize more loopback interfaces
        # e.g., sudo ifconfig lo0 alias 127.0.0.2

        # use stress to get some sample latency data
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
        tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

        bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
        {code}

        h3. What and Why

        We've implemented [Probabilistically Bounded Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

        This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}}, meaning users don't know how stale their data will be.)

        We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

        h3. Interface

        This patch allows users to perform this prediction in production using {{nodetool}}.

        Users enable tracing of latency data by calling {{enableConsistencyPredictionLogging()}} in the {{PBSPredictorMBean}}.

        Cassandra logs a variable number of latencies (controllable via JMX ({{setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged)}}, default: 10000). Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

        {{nodetool predictconsistency}} predicts the latency and consistency for each possible {{ConsistencyLevel}} setting (reads and writes) by running {{setNumberTrialsForConsistencyPrediction(int numTrials)}} Monte Carlo trials per configuration (default: 10000).

        Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

        h3. Implementation

        This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

        h4. Latency Data

        We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
         * *W*: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
         * *A*: time from when a replica accepting a mutation sends an
         * *R*: time from when the coordinator sends a read request to the time that the replica performs the read
        * *S*: time from when the replica sends a read response to the time when the coordinator receives it

        We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}}, we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. Jonathan Ellis mentioned that [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

        h4. Prediction

        When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

        h4. Testing

        We've modified the unit test for {{SerializationsTest}} and provided a new unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant pbs-test}}.

        h4. Overhead

        This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is messy.

        If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the node on which tracing is enabled.

        h3. Caveats

         The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:
         * We do not account for read repair.
         * We do not account for Merkle tree exchange.
         * Multi-version staleness is particularly conservative.

        The predictions are optimistic in the following ways:
         * We do not predict the impact of node failure.
         * We do not model hinted handoff.

        We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key. (See discussion below.)

        Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

        Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

        We can potentially improve these if there's interest, but this is an area of active research.
        ----
        Peter Bailis and Shivaram Venkataraman
        [pbailis@cs.berkeley.edu|mailto:pbailis@cs.berkeley.edu]
        [shivaram@cs.berkeley.edu|mailto:shivaram@cs.berkeley.edu]
        Hide
        Peter Bailis added a comment -

        Is there anything else you'd like to have us do for the patch?

        Show
        Peter Bailis added a comment - Is there anything else you'd like to have us do for the patch?
        Hide
        Jonathan Ellis added a comment -

        Sorry, hoping to be able to leverage parts of CASSANDRA-1123 which has taken longer than anticipated. Should be in soon though. In retrospect, should have gone ahead with doing this one first, my apologies.

        Show
        Jonathan Ellis added a comment - Sorry, hoping to be able to leverage parts of CASSANDRA-1123 which has taken longer than anticipated. Should be in soon though. In retrospect, should have gone ahead with doing this one first, my apologies.
        Hide
        Matt Blair added a comment -

        So now that CASSANDRA-1123 has been resolved, will this get merged in time for 1.2?

        Show
        Matt Blair added a comment - So now that CASSANDRA-1123 has been resolved, will this get merged in time for 1.2?
        Hide
        Jonathan Ellis added a comment -

        v4 attached, mostly rebased to trunk.

        "Mostly" means that I'm not sure what to do with the message timestamps. CASSANDRA-2858 added the timestamp to QueuedMessage / MessageDeliveryTask instead of MessageOut/MessageIn. In the case of MessageOut/QM, QM is definitely the right place since MO construction is relatively expensive so we avoid any per-replica information there. Less clear for MessageIn/MDT: if we leave it in MDT we either need to add the timestamp to all IVerbHandler for this one special case, or do some hackish contortions to pass it just to ResponseVerbHandler (and thence to PBSPredictor). OTH putting it in MessageIn breaks the MI/MO symmetry which is confusing.

        Thoughts?

        Also: is there any way we can leverage the metrics from CASSANDRA-4009 instead of storing a "second copy" of certain metrics in PBS?

        Show
        Jonathan Ellis added a comment - v4 attached, mostly rebased to trunk. "Mostly" means that I'm not sure what to do with the message timestamps. CASSANDRA-2858 added the timestamp to QueuedMessage / MessageDeliveryTask instead of MessageOut/MessageIn. In the case of MessageOut/QM, QM is definitely the right place since MO construction is relatively expensive so we avoid any per-replica information there. Less clear for MessageIn/MDT: if we leave it in MDT we either need to add the timestamp to all IVerbHandler for this one special case, or do some hackish contortions to pass it just to ResponseVerbHandler (and thence to PBSPredictor). OTH putting it in MessageIn breaks the MI/MO symmetry which is confusing. Thoughts? Also: is there any way we can leverage the metrics from CASSANDRA-4009 instead of storing a "second copy" of certain metrics in PBS?
        Jonathan Ellis made changes -
        Attachment 4261-v4.txt [ 12545442 ]
        Hide
        Peter Bailis added a comment -

        Jonathan,

        Thanks for the rebase! Looking at the updated code, we can still log the start of the operation in MessagingService.sendRR() but move the reply timestamp logging from the ResponseVerbHandler to MessagingService.receive(). This won't be too bad, and we can filter the MessageIn instances passed to PBSPredictor by both the verb type and/or by id. Does that make sense?

        Also, re: CASSANDRA-4009, it should be possible to use this code, but there are two issues:
        1.) We need finer-granularity tracing than what is currently implemented. We need to know how long it takes to hit a given node and not just the end-to-end round-trip latencies.
        2.) Using a histogram instead of keeping around the actual latencies will reduce the fidelity of the predictions. The impact of this depends on the bucket size and distribution.

        Let us know what you think!

        Show
        Peter Bailis added a comment - Jonathan, Thanks for the rebase! Looking at the updated code, we can still log the start of the operation in MessagingService.sendRR() but move the reply timestamp logging from the ResponseVerbHandler to MessagingService.receive(). This won't be too bad, and we can filter the MessageIn instances passed to PBSPredictor by both the verb type and/or by id. Does that make sense? Also, re: CASSANDRA-4009 , it should be possible to use this code, but there are two issues: 1.) We need finer-granularity tracing than what is currently implemented. We need to know how long it takes to hit a given node and not just the end-to-end round-trip latencies. 2.) Using a histogram instead of keeping around the actual latencies will reduce the fidelity of the predictions. The impact of this depends on the bucket size and distribution. Let us know what you think!
        Hide
        Jonathan Ellis added a comment -

        v5 attached as you suggested.

        NB, intellij reports that trialALatencies and trialSLatencies are never read from in PBSP.doPrediction.

        Show
        Jonathan Ellis added a comment - v5 attached as you suggested. NB, intellij reports that trialALatencies and trialSLatencies are never read from in PBSP.doPrediction.
        Jonathan Ellis made changes -
        Attachment 4261-v5.txt [ 12546774 ]
        Jonathan Ellis made changes -
        Assignee Peter Bailis [ pbailis ]
        Fix Version/s 1.2.0 beta 2 [ 12323284 ]
        Affects Version/s 1.2.0 beta 1 [ 12319262 ]
        Priority Major [ 3 ] Minor [ 4 ]
        Reviewer jbellis
        Hide
        Shivaram Venkataraman added a comment -

        Remove trialALatencies and trialSLatencies as they weren't required. v6 attached

        Show
        Shivaram Venkataraman added a comment - Remove trialALatencies and trialSLatencies as they weren't required. v6 attached
        Shivaram Venkataraman made changes -
        Attachment 4261-v6.txt [ 12546887 ]
        Hide
        Jonathan Ellis added a comment -

        committed!

        Show
        Jonathan Ellis added a comment - committed!
        Jonathan Ellis made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Gavin made changes -
        Workflow no-reopen-closed, patch-avail [ 12668624 ] patch-available, re-open possible [ 12753119 ]
        Gavin made changes -
        Workflow patch-available, re-open possible [ 12753119 ] reopen-resolved, no closed status, patch-avail, testing [ 12755767 ]

          People

          • Assignee:
            Peter Bailis
            Reporter:
            Peter Bailis
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development