Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-4261

[patch] Support consistency-latency prediction in nodetool

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Low
    • Resolution: Fixed
    • 1.2.0 beta 2
    • Tool/nodetool
    • None

    Description

      Introduction

      Cassandra supports a variety of replication configurations: ReplicationFactor is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

      This patch provides a latency-consistency analysis within nodetool. Users can accurately predict Cassandra's behavior in their production environments without interfering with performance.

      What's the probability that we'll read a write t seconds after it completes? What about reading one of the last k writes? This patch provides answers via nodetool predictconsistency:

      nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions

      Example output
      //N == ReplicationFactor
      //R == read ConsistencyLevel
      //W == write ConsistencyLevel
      
      user@test:$ nodetool predictconsistency 3 100 1
      Performing consistency prediction
      100ms after a given write, with maximum version staleness of k=1
      N=3, R=1, W=1
      Probability of consistent reads: 0.678900
      Average read latency: 5.377900ms (99.900th %ile 40ms)
      Average write latency: 36.971298ms (99.900th %ile 294ms)
      
      N=3, R=1, W=2
      Probability of consistent reads: 0.791600
      Average read latency: 5.372500ms (99.900th %ile 39ms)
      Average write latency: 303.630890ms (99.900th %ile 357ms)
      
      N=3, R=1, W=3
      Probability of consistent reads: 1.000000
      Average read latency: 5.426600ms (99.900th %ile 42ms)
      Average write latency: 1382.650879ms (99.900th %ile 629ms)
      
      N=3, R=2, W=1
      Probability of consistent reads: 0.915800
      Average read latency: 11.091000ms (99.900th %ile 348ms)
      Average write latency: 42.663101ms (99.900th %ile 284ms)
      
      N=3, R=2, W=2
      Probability of consistent reads: 1.000000
      Average read latency: 10.606800ms (99.900th %ile 263ms)
      Average write latency: 310.117615ms (99.900th %ile 335ms)
      
      N=3, R=3, W=1
      Probability of consistent reads: 1.000000
      Average read latency: 52.657501ms (99.900th %ile 565ms)
      Average write latency: 39.949799ms (99.900th %ile 237ms)
      

      Demo

      Here's an example scenario you can run using ccm. The prediction is fast:

      cd <cassandra-source-dir with patch applied>
      ant
      
      ccm create consistencytest --cassandra-dir=. 
      ccm populate -n 5
      ccm start
      
      # if start fails, you might need to initialize more loopback interfaces
      # e.g., sudo ifconfig lo0 alias 127.0.0.2
      
      # use stress to get some sample latency data
      tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
      tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
      
      bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
      

      What and Why

      We've implemented Probabilistically Bounded Staleness, a new technique for predicting consistency-latency trade-offs within Cassandra. Our paper will appear in VLDB 2012, and, in it, we've used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).

      This analysis is important for the many users we've talked to and heard about who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevel). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't know how stale their data will be.)

      We outline limitations of the current approach after describing how it's done. We believe that this is a useful feature that can provide guidance and fairly accurate estimation for most users.

      Interface

      This patch allows users to perform this prediction in production using nodetool.

      Users enable tracing of latency data by calling enableConsistencyPredictionLogging() in the PBSPredictorMBean.

      Cassandra logs a variable number of latencies (controllable via JMX (setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged), default: 10000). Each latency is 8 bytes, and there are 4 distributions we require, so the space overhead is 32*logged_latencies bytes of memory for the predicting node.

      nodetool predictconsistency predicts the latency and consistency for each possible ConsistencyLevel setting (reads and writes) by running setNumberTrialsForConsistencyPrediction(int numTrials) Monte Carlo trials per configuration (default: 10000).

      Users shouldn't have to touch these parameters, and the defaults work well. The more latencies they log, the better the predictions will be.

      Implementation

      This patch is fairly lightweight, requiring minimal changes to existing code. The high-level overview is that we gather empirical latency distributions then perform Monte Carlo analysis using the gathered data.

      Latency Data

      We log latency data in service.PBSPredictor, recording four relevant distributions:

      • W: time from when the coordinator sends a mutation to the time that a replica begins to serve the new value(s)
      • A: time from when a replica accepting a mutation sends an
      • R: time from when the coordinator sends a read request to the time that the replica performs the read
      • S: time from when the replica sends a read response to the time when the coordinator receives it

      We augment net.MessageIn and net.MessageOut to store timestamps along with every message (8 bytes overhead required for millisecond long). In net.MessagingService, we log the start of every mutation and read, and, in net.ResponseVerbHandler, we log the end of every mutation and read. Jonathan Ellis mentioned that 1123 had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch. We use an LRU policy to bound the number of latencies we track for each distribution.

      Prediction

      When prompted by nodetool, we call service.PBSPredictor.doPrediction, which performs the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've commented this analysis pretty well but can elaborate more here if required.

      Testing

      We've modified the unit test for SerializationsTest and provided a new unit test for PBSPredictor (PBSPredictorTest). You can run the PBSPredictor test with ant pbs-test.

      Overhead

      This patch introduces 8 bytes of overhead per message. We could reduce this overhead and add timestamps on-demand, but this would require changing net.MessageIn and net.MessageOut serialization at runtime, which is messy.

      If enabled, consistency tracing requires 32*logged_latencies bytes of memory on the node on which tracing is enabled.

      Caveats

      The predictions are conservative, or worst-case, meaning we may predict more staleness than in practice in the following ways:

      • We do not account for read repair.
      • We do not account for Merkle tree exchange.
      • Multi-version staleness is particularly conservative.

      The predictions are optimistic in the following ways:

      • We do not predict the impact of node failure.
      • We do not model hinted handoff.

      We simulate non-local reads and writes. We assume that the coordinating Cassandra node is not itself a replica for a given key. (See discussion below.)

      Predictions are only as good as the collected latencies. Generally, the more latencies that are collected, the better, but if the environment or workload changes, things might change. Also, we currently don't distinguish between column families or value sizes. This is doable, but it adds complexity to the interface and possibly more storage overhead.

      Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions by the magnitude of the skew.

      We can potentially improve these if there's interest, but this is an area of active research.


      Peter Bailis and Shivaram Venkataraman
      pbailis@cs.berkeley.edu
      shivaram@cs.berkeley.edu

      Attachments

        1. 4261-v4.txt
          44 kB
          Jonathan Ellis
        2. 4261-v5.txt
          45 kB
          Jonathan Ellis
        3. 4261-v6.txt
          44 kB
          Shivaram Venkataraman
        4. demo-pbs-v3.sh
          1 kB
          Shivaram Venkataraman
        5. pbs-nodetool-v3.patch
          50 kB
          Shivaram Venkataraman

        Activity

          People

            pbailis Peter Bailis
            pbailis Peter Bailis
            Peter Bailis
            Jonathan Ellis
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: