• Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      Currently, repair works by periodically running "bulk" jobs that (1)
      performs a validating compaction building up an in-memory merkle tree,
      and (2) streaming ring segments as needed according to differences
      indicated by the merkle tree.

      There are some disadvantages to this approach:

      • There is a trade-off between memory usage and the precision of the
        merkle tree. Less precision means more data streamed relative to
        what is strictly required.
      • Repair is a periodic "bulk" process that runs for a significant
        period and, although possibly rate limited as compaction (if 0.8 or
        backported throttling patch applied), is a divergence in terms of
        performance characteristics from "normal" operation of the cluster.
      • The impact of imprecision can be huge on a workload dominated by I/O
        and with cache locality being critical, since you will suddenly
        transfers lots of data to the target node.

      I propose a more incremental process whereby anti-entropy is
      incremental and continuous over time. In order to avoid being
      seek-bound one still wants to do work in some form of bursty fashion,
      but the amount of data processed at a time could be sufficiently small
      that the impact on the cluster feels a lot more continuous, and that
      the page cache allows us to avoid re-reading differing data twice.

      Consider a process whereby a node is constantly performing a per-CF
      repair operation for each CF. The current state of the repair process
      is defined by:

      • A starting timestamp of the current iteration through the token
        range the node is responsible for.
      • A "finger" indicating the current position along the token ring to
        which iteration has completed.

      This information, other than being in-memory, could periodically (every
      few minutes or something) be stored persistently on disk.

      The finger advances by the node selecting the next small "bit" of the
      ring and doing whatever merkling/hashing/checksumming is necessary on
      that small part, and then asking neighbors to do the same, and
      arranging for neighbors to send the node data for mismatching
      ranges. The data would be sent either by way of mutations like with
      read repair, or by streaming sstables. But it would be small amounts
      of data that will act roughly the same as regular writes for the
      perspective of compaction.

      Some nice properties of this approach:

      • It's "always on"; no periodic sudden effects on cluster performance.
      • Restarting nodes never cancels or breaks anti-entropy.
      • Huge compactions of entire CF:s never clog up the compaction queue
        (not necessarily a non-issue even with concurrent compactions in
      • Because we're always operating on small chunks, there is never the
        same kind of trade-off for memory use. A merkel tree or similar
        could be calculated at a very detailed level potentially. Although
        the precision from the perspective of reading from disk would likely
        not matter much if we are in page cache anyway, very high precision
        could be very useful when doing anti-entropy across data centers
        on slow links.

      There are devils in details, like how to select an appropriate ring
      segment given that you don't have knowledge of the data density on
      other nodes. But I feel that the overall idea/process seems very


          Issue Links



              • Assignee:
                scode Peter Schuller
              • Votes:
                5 Vote for this issue
                26 Start watching this issue


                • Created: