Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Currently, repair works by periodically running "bulk" jobs that (1)
      performs a validating compaction building up an in-memory merkle tree,
      and (2) streaming ring segments as needed according to differences
      indicated by the merkle tree.

      There are some disadvantages to this approach:

      • There is a trade-off between memory usage and the precision of the
        merkle tree. Less precision means more data streamed relative to
        what is strictly required.
      • Repair is a periodic "bulk" process that runs for a significant
        period and, although possibly rate limited as compaction (if 0.8 or
        backported throttling patch applied), is a divergence in terms of
        performance characteristics from "normal" operation of the cluster.
      • The impact of imprecision can be huge on a workload dominated by I/O
        and with cache locality being critical, since you will suddenly
        transfers lots of data to the target node.

      I propose a more incremental process whereby anti-entropy is
      incremental and continuous over time. In order to avoid being
      seek-bound one still wants to do work in some form of bursty fashion,
      but the amount of data processed at a time could be sufficiently small
      that the impact on the cluster feels a lot more continuous, and that
      the page cache allows us to avoid re-reading differing data twice.

      Consider a process whereby a node is constantly performing a per-CF
      repair operation for each CF. The current state of the repair process
      is defined by:

      • A starting timestamp of the current iteration through the token
        range the node is responsible for.
      • A "finger" indicating the current position along the token ring to
        which iteration has completed.

      This information, other than being in-memory, could periodically (every
      few minutes or something) be stored persistently on disk.

      The finger advances by the node selecting the next small "bit" of the
      ring and doing whatever merkling/hashing/checksumming is necessary on
      that small part, and then asking neighbors to do the same, and
      arranging for neighbors to send the node data for mismatching
      ranges. The data would be sent either by way of mutations like with
      read repair, or by streaming sstables. But it would be small amounts
      of data that will act roughly the same as regular writes for the
      perspective of compaction.

      Some nice properties of this approach:

      • It's "always on"; no periodic sudden effects on cluster performance.
      • Restarting nodes never cancels or breaks anti-entropy.
      • Huge compactions of entire CF:s never clog up the compaction queue
        (not necessarily a non-issue even with concurrent compactions in
        0.8).
      • Because we're always operating on small chunks, there is never the
        same kind of trade-off for memory use. A merkel tree or similar
        could be calculated at a very detailed level potentially. Although
        the precision from the perspective of reading from disk would likely
        not matter much if we are in page cache anyway, very high precision
        could be very useful when doing anti-entropy across data centers
        on slow links.

      There are devils in details, like how to select an appropriate ring
      segment given that you don't have knowledge of the data density on
      other nodes. But I feel that the overall idea/process seems very
      promising.

        Issue Links

          Activity

          Hide
          Peter Schuller added a comment -

          I should also add that this type of approach completely eliminates the need for operators to initiate and monitor repair, which is something which seems to cause lots of confusion. What is still needed is monitoring that Cassandra isn't saying "something is wrong, I seem to be running behind with repairs". But actually exposing such a binary condition should be trivial if the rest is implemented.

          The alerting would be even less critical if the information kept as part of an implementation of this were to be used when doing compaction; one could prevent automatically tombstones from ever being removed pre-maturely based on whether or not AES has happened recently enough.

          (All assumes we can accurately and reliably determine that repair actually succeeded, so bugs causing silent failure of repair need fixing and it needs to not easily break again.)

          Show
          Peter Schuller added a comment - I should also add that this type of approach completely eliminates the need for operators to initiate and monitor repair, which is something which seems to cause lots of confusion. What is still needed is monitoring that Cassandra isn't saying "something is wrong, I seem to be running behind with repairs". But actually exposing such a binary condition should be trivial if the rest is implemented. The alerting would be even less critical if the information kept as part of an implementation of this were to be used when doing compaction; one could prevent automatically tombstones from ever being removed pre-maturely based on whether or not AES has happened recently enough. (All assumes we can accurately and reliably determine that repair actually succeeded, so bugs causing silent failure of repair need fixing and it needs to not easily break again.)
          Hide
          Jonathan Ellis added a comment - - edited

          This seems like a good place to summarize a discussion from IRC:

          Terje had an interesting idea: instead of repair operating on ranges (with merkle trees) maybe the right unit of repair is an sstable: i.e. when you repair an sstable, it does CL.ALL reads of each row, then adds a "repaired" metadata. Compared to merkle-tree repair, you are doing random i/o, in exchange for being able to only worry about new-since-last-repair data (and whatever we have to merge, that BF doesn't skip).

          [This feels like a better fit for Cassandra to me (Jonathan), because it takes advantage of SSTable immutability. Constantly re-repairing data that doesn't need it (in large increments or small) is wasteful.]

          Stu added: one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair.

          Peter commented: The only significant downside I can think of off the top of my head is that when nodes are unavailable or otherwise have some problem for a long time, the sstables created since last repair might be a non-trivial amount of data. So it doesn't [in the worst case] fulfil the goal of "guaranteeing" that the "bulkyness" of an individual repair does not exceed some reasonable value. But it sounds a lot better still than the current situation.

          Show
          Jonathan Ellis added a comment - - edited This seems like a good place to summarize a discussion from IRC: Terje had an interesting idea: instead of repair operating on ranges (with merkle trees) maybe the right unit of repair is an sstable: i.e. when you repair an sstable, it does CL.ALL reads of each row, then adds a "repaired" metadata. Compared to merkle-tree repair, you are doing random i/o, in exchange for being able to only worry about new-since-last-repair data (and whatever we have to merge, that BF doesn't skip). [This feels like a better fit for Cassandra to me (Jonathan), because it takes advantage of SSTable immutability. Constantly re-repairing data that doesn't need it (in large increments or small) is wasteful.] Stu added: one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair. Peter commented: The only significant downside I can think of off the top of my head is that when nodes are unavailable or otherwise have some problem for a long time, the sstables created since last repair might be a non-trivial amount of data. So it doesn't [in the worst case] fulfil the goal of "guaranteeing" that the "bulkyness" of an individual repair does not exceed some reasonable value. But it sounds a lot better still than the current situation.
          Hide
          Jeremiah Jordan added a comment -

          I like the idea of not repairing stuff that has already been repaired, one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key.

          Show
          Jeremiah Jordan added a comment - I like the idea of not repairing stuff that has already been repaired, one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key.
          Hide
          Peter Schuller added a comment -

          Good point. I think that is significant because it means that the question "Is the range for which I am the primary replica repaired?" becomes a more difficult question to answer. Unless range scans are used at CL.ALL. Not sure which was intended.

          Show
          Peter Schuller added a comment - Good point. I think that is significant because it means that the question "Is the range for which I am the primary replica repaired?" becomes a more difficult question to answer. Unless range scans are used at CL.ALL. Not sure which was intended.
          Hide
          Chris Burroughs added a comment -

          If we don't repair stuff that has already been marked as repaired, doesn't that prevent repair from recovering from bitrot or bugs in the (previous) repair process?

          Show
          Chris Burroughs added a comment - If we don't repair stuff that has already been marked as repaired, doesn't that prevent repair from recovering from bitrot or bugs in the (previous) repair process?
          Hide
          Peter Schuller added a comment -

          I would argue that repair's ability to do that is limited as it is, and the primary concern is making anti-entropy a less invasive and more reliable process. Isn't 'scrub' closer to what you'd want for bitrot btw?

          However, longer-term with checksumming and the ability to truly handle arbitrary corruption, I think it would be a great feature to integrate the detection of checksum mismatches with automatic repair from neighboring nodes (this time the word 'repair' truly being repair, not anti-entropy - well, unless you count bitrot as entropy, which I suppose one should ).

          Show
          Peter Schuller added a comment - I would argue that repair's ability to do that is limited as it is, and the primary concern is making anti-entropy a less invasive and more reliable process. Isn't 'scrub' closer to what you'd want for bitrot btw? However, longer-term with checksumming and the ability to truly handle arbitrary corruption, I think it would be a great feature to integrate the detection of checksum mismatches with automatic repair from neighboring nodes (this time the word 'repair' truly being repair, not anti-entropy - well, unless you count bitrot as entropy, which I suppose one should ).
          Hide
          Jonathan Ellis added a comment - - edited

          one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair

          I'm not sure this works, because as compaction does its work you will have a lot of sstable turnover. Even if you preserve repaired state where possible (two repaired sstables, merge to form an sstable also marked repaired) new sstables will "fragment" out as they're compacted under leveldb, "contaminating" what they are merged with.

          (Note that the problem gets worse if a peer is down, because then (as Peter notes) un-repaired sstables start to pile up. Maybe we can fall back to merkle trees if more than N% of sstables are un-repaired.)

          You could create separate sets of levels for "repaired" and "unrepaired" sstables, I suppose. That feels ugly.

          You could also keep a "repaired" bloom filter at the row level for partially-repaired sstables. That feels more reasonable to me. (But that brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.)

          one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key

          I don't think this is a blocker – it just means you still have to run repair against each node, which has always been the case.

          Show
          Jonathan Ellis added a comment - - edited one way to incorporate terje's idea would be to continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair I'm not sure this works, because as compaction does its work you will have a lot of sstable turnover. Even if you preserve repaired state where possible (two repaired sstables, merge to form an sstable also marked repaired) new sstables will "fragment" out as they're compacted under leveldb, "contaminating" what they are merged with. (Note that the problem gets worse if a peer is down, because then (as Peter notes) un-repaired sstables start to pile up. Maybe we can fall back to merkle trees if more than N% of sstables are un-repaired.) You could create separate sets of levels for "repaired" and "unrepaired" sstables, I suppose. That feels ugly. You could also keep a "repaired" bloom filter at the row level for partially-repaired sstables. That feels more reasonable to me. (But that brings us back to doing repair-by-CL.ALL reads rather than trees of ranges.) one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key I don't think this is a blocker – it just means you still have to run repair against each node, which has always been the case.
          Hide
          Peter Schuller added a comment -

          CASSANDRA-2699 has a baby step towards this which addresses incremental repair and the merkle tree resolution problem (but does not remove, in fact increases, the need for external scripting).

          Show
          Peter Schuller added a comment - CASSANDRA-2699 has a baby step towards this which addresses incremental repair and the merkle tree resolution problem (but does not remove, in fact increases, the need for external scripting).
          Hide
          Jonathan Ellis added a comment -

          I assume you meant to link a different issue?

          Show
          Jonathan Ellis added a comment - I assume you meant to link a different issue?
          Hide
          Peter Schuller added a comment -

          Sorry yes - CASSANDRA-3912.

          Show
          Peter Schuller added a comment - Sorry yes - CASSANDRA-3912 .
          Hide
          Jonathan Ellis added a comment -

          A big problem with the only-repair-new-data idea is that it doesn't deal with hardware-level data loss – i.e., we want to re-repair data that was complete, until we lost a disk or a machine.

          Show
          Jonathan Ellis added a comment - A big problem with the only-repair-new-data idea is that it doesn't deal with hardware-level data loss – i.e., we want to re-repair data that was complete, until we lost a disk or a machine.
          Hide
          Benjamin Coverston added a comment - - edited

          From CASSANDRA-4482

          Instead, we maintain a Merkle Tree (MT) in memory and update it with every single column insert in ColumnFamilyStore.apply(). We use column.updateDigest(digest) on all the changes in order to create a hash per column update and then XOR this hash with the existing one in the Merkle Tree bucket for the corresponding row.
          This Merkle Tree is created with the column family (one per range), initialized with zeros, and persisted to disk with regular snapshots.

          The commutative properties of XOR make it possible to update the MT incrementally without having to read on write.

          I'm pretty sure this means that we should be able to XOR the buckets together from pre-computed merkle tree SSTable components.

          We could just create these on flush, merge them on compaction, then validation compaction is just a read of MT components and merge.

          Show
          Benjamin Coverston added a comment - - edited From CASSANDRA-4482 Instead, we maintain a Merkle Tree (MT) in memory and update it with every single column insert in ColumnFamilyStore.apply(). We use column.updateDigest(digest) on all the changes in order to create a hash per column update and then XOR this hash with the existing one in the Merkle Tree bucket for the corresponding row. This Merkle Tree is created with the column family (one per range), initialized with zeros, and persisted to disk with regular snapshots. The commutative properties of XOR make it possible to update the MT incrementally without having to read on write. I'm pretty sure this means that we should be able to XOR the buckets together from pre-computed merkle tree SSTable components. We could just create these on flush, merge them on compaction, then validation compaction is just a read of MT components and merge.
          Hide
          Sylvain Lebresne added a comment -

          I'm pretty sure this means that we should be able to XOR the buckets together from pre-computed merkle tree SSTable components

          No, I don't think that works. Because nodes are certainly not guaranteed to be at the same state of compaction, even if they have the same data. Meaning that for a given column whose value A has been overwritten to B at some point, one of the nodes may have 1 sstable with just B while another node may have 2 sstables, one with A and one with B, because it hasn't compacted things yet. But hash(B) (on the first node) will not be equal to hash(A) xor hash(B) (on the second node), even though both node really have the same data (since B overwrites A upon column merge on the 2nd node).

          Show
          Sylvain Lebresne added a comment - I'm pretty sure this means that we should be able to XOR the buckets together from pre-computed merkle tree SSTable components No, I don't think that works. Because nodes are certainly not guaranteed to be at the same state of compaction, even if they have the same data. Meaning that for a given column whose value A has been overwritten to B at some point, one of the nodes may have 1 sstable with just B while another node may have 2 sstables, one with A and one with B, because it hasn't compacted things yet. But hash(B) (on the first node) will not be equal to hash(A) xor hash(B) (on the second node), even though both node really have the same data (since B overwrites A upon column merge on the 2nd node).
          Hide
          Benjamin Coverston added a comment -

          You're right, which also means that in the face of idempotent writes and replay the incremental scenario is also broken with the in-memory tree.

          Show
          Benjamin Coverston added a comment - You're right, which also means that in the face of idempotent writes and replay the incremental scenario is also broken with the in-memory tree.
          Hide
          Sylvain Lebresne added a comment -

          in the face of idempotent writes and replay the incremental scenario is also broken with the in-memory tree

          Yeah, and streaming and read-repair breaks things too I too (for the in-memory tree idea), and I'm not sure how you even compute the initial in-memory tree at startup in the first-place (there's a talk of saving the in-memory tree on disk to reload it on restart, but I see many problem with that so it could be I misunderstood the idea). But overall it sounded the conclusion of CASSANDRA-4482 was that "the in-memory trees themselves do not sound very useful", which from my current comprehension of the idea (that could be very partial/wrong) sounds about right.

          Show
          Sylvain Lebresne added a comment - in the face of idempotent writes and replay the incremental scenario is also broken with the in-memory tree Yeah, and streaming and read-repair breaks things too I too (for the in-memory tree idea), and I'm not sure how you even compute the initial in-memory tree at startup in the first-place (there's a talk of saving the in-memory tree on disk to reload it on restart, but I see many problem with that so it could be I misunderstood the idea). But overall it sounded the conclusion of CASSANDRA-4482 was that "the in-memory trees themselves do not sound very useful", which from my current comprehension of the idea (that could be very partial/wrong) sounds about right.
          Hide
          Jonathan Ellis added a comment -

          continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair

          After thinking about this for a while, I think this is our best option, despite the drawbacks I outlined. I've opened CASSANDRA-5351 to start fresh without the historical noise here.

          Show
          Jonathan Ellis added a comment - continue to build merkle trees, but to only build them using sstables that have been created since the last time we ran repair After thinking about this for a while, I think this is our best option, despite the drawbacks I outlined. I've opened CASSANDRA-5351 to start fresh without the historical noise here.

            People

            • Assignee:
              Unassigned
              Reporter:
              Peter Schuller
            • Votes:
              5 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development