Cassandra
  1. Cassandra
  2. CASSANDRA-2610

Have the repair of a range repair *all* the replica for that range

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:
      None

      Description

      Say you have a range R whose replica for that range are A, B and C. If you run repair on node A for that range R, when the repair end you only know that A is fully repaired. B and C are not. That is B and C are up to date with A before the repair, but are not up to date with one another.

      It makes it a pain to schedule "optimal" cluster repairs, that is repairing a full cluster without doing work twice (because you would have still have to run a repair on B or C, which will make A, B and C redo a validation compaction on R, and with more replica it's even more annoying).

      However it is fairly easy during the first repair on A to have him compare all the merkle trees, i.e the ones for B and C, and ask to B or C to stream between them whichever the differences they have.

      1. 0003-cleanup-and-fix-private-reference.patch
        11 kB
        Jonathan Ellis
      2. 0002-Cleanup-log-messages-v2.patch
        8 kB
        Sylvain Lebresne
      3. 0001-Make-repair-repair-all-hosts-v2.patch
        33 kB
        Sylvain Lebresne
      4. 0001-Make-repair-repair-all-hosts.patch
        51 kB
        Sylvain Lebresne

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          I thought it already works like this (because when you repair A, you'll often see additional data stream to B, C). So +1 for making it conform to my mental picture.

          Show
          Jonathan Ellis added a comment - I thought it already works like this (because when you repair A, you'll often see additional data stream to B, C). So +1 for making it conform to my mental picture.
          Hide
          Sylvain Lebresne added a comment -

          Patch against 0.8.1. It applies on top of CASSANDRA-2433 because it is changing enough of common code that I don't want to have to deal with the rebase back and forth (and it actually reuse some of the refactoring of CASSANDRA-2433 anyway)

          Show
          Sylvain Lebresne added a comment - Patch against 0.8.1. It applies on top of CASSANDRA-2433 because it is changing enough of common code that I don't want to have to deal with the rebase back and forth (and it actually reuse some of the refactoring of CASSANDRA-2433 anyway)
          Hide
          Sylvain Lebresne added a comment -

          v2 attached (based on trunk).

          It includes a second patch that simply cleanup the messages logged by repair (so it is easier to follow what is going on). I could have done this on a separate ticket but rebasing repair tickets against each other is getting old.

          Note that I updated the MessagingService to list version 1.0, as imho this is much cleaner.

          Show
          Sylvain Lebresne added a comment - v2 attached (based on trunk). It includes a second patch that simply cleanup the messages logged by repair (so it is easier to follow what is going on). I could have done this on a separate ticket but rebasing repair tickets against each other is getting old. Note that I updated the MessagingService to list version 1.0, as imho this is much cleaner.
          Hide
          Jonathan Ellis added a comment -

          03 fixes a reference to private UUIDGen.instance field, and cleans up RepairJob to (1) use concurrent structures instead of synchronized and (2) track Differencer objects directly instead of using Pair<InetAddress, InetAddress> as proxies. Also adds some comments.

          otherwise, lgtm.

          Show
          Jonathan Ellis added a comment - 03 fixes a reference to private UUIDGen.instance field, and cleans up RepairJob to (1) use concurrent structures instead of synchronized and (2) track Differencer objects directly instead of using Pair<InetAddress, InetAddress> as proxies. Also adds some comments. otherwise, lgtm.
          Hide
          Sylvain Lebresne added a comment -

          use concurrent structures instead of synchronized

          I think this doesn't work, or need modifications elsewhere. The synchronized was making sure that for a given job, there would be one and only one call to addTree that returns 0 (otherwise we'll generate twice the differencers in that case). The synchronized was to make the 'remove then return length' atomic, more than to protect the access to the structure. completedSynchronization does have the same problem even though in that case the consequence is benign, we'll just print the sync message twice (it's still less good than with synchronized).

          Also, without having ran it, I'm not sure the change to the test works, because since REMOTE won't ever send a merkle tree, the differencers won't ever be generated. But to be honest, this test is really not very useful and I would be fine with just removing it instead of faking things to a point where it doesn't really test anything.

          Show
          Sylvain Lebresne added a comment - use concurrent structures instead of synchronized I think this doesn't work, or need modifications elsewhere. The synchronized was making sure that for a given job, there would be one and only one call to addTree that returns 0 (otherwise we'll generate twice the differencers in that case). The synchronized was to make the 'remove then return length' atomic, more than to protect the access to the structure. completedSynchronization does have the same problem even though in that case the consequence is benign, we'll just print the sync message twice (it's still less good than with synchronized). Also, without having ran it, I'm not sure the change to the test works, because since REMOTE won't ever send a merkle tree, the differencers won't ever be generated. But to be honest, this test is really not very useful and I would be fine with just removing it instead of faking things to a point where it doesn't really test anything.
          Hide
          Jonathan Ellis added a comment -

          new 03 adds back synchronization (with comments) and removes the "manual repair test."

          Show
          Jonathan Ellis added a comment - new 03 adds back synchronization (with comments) and removes the "manual repair test."
          Hide
          Sylvain Lebresne added a comment -

          That last patch lgtm, I'll go ahead and commit.

          Show
          Sylvain Lebresne added a comment - That last patch lgtm, I'll go ahead and commit.
          Hide
          Sylvain Lebresne added a comment -

          Committed, thanks

          Show
          Sylvain Lebresne added a comment - Committed, thanks
          Hide
          Hudson added a comment -

          Integrated in Cassandra #1067 (See https://builds.apache.org/job/Cassandra/1067/)
          Make repair of a range sync all replica pairs for this range
          patch by slebresne; reviewed by jbellis for CASSANDRA-2610

          slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164463
          Files :

          • /cassandra/trunk/CHANGES.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/net/MessagingService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java
          • /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamingRepairTask.java
          • /cassandra/trunk/src/java/org/apache/cassandra/utils/UUIDGen.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/io/CompactSerializerTest.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/service/AntiEntropyServiceTestAbstract.java
          Show
          Hudson added a comment - Integrated in Cassandra #1067 (See https://builds.apache.org/job/Cassandra/1067/ ) Make repair of a range sync all replica pairs for this range patch by slebresne; reviewed by jbellis for CASSANDRA-2610 slebresne : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164463 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/net/MessagingService.java /cassandra/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java /cassandra/trunk/src/java/org/apache/cassandra/streaming/StreamingRepairTask.java /cassandra/trunk/src/java/org/apache/cassandra/utils/UUIDGen.java /cassandra/trunk/test/unit/org/apache/cassandra/io/CompactSerializerTest.java /cassandra/trunk/test/unit/org/apache/cassandra/service/AntiEntropyServiceTestAbstract.java
          Hide
          Omid Aladini added a comment -

          This indeed makes repair across a cluster easier to manage, specially together with -pr (CASSANDRA-2606), but the downside is all replica for a range would be affected once the data is streamed. In my case repair transfers huge amount of data each time (possibly due to Merkle tree precision CASSANDRA-2698) causing hundreds of pending compactions that affects reads and counter-writes for the affected range. I'd prefer to have cassandra calculate Merkle trees multiple times (which is possible to throttle) and have faster quorum reads when only one replica is slowed down. Given that incremental repair (CASSANDRA-2699) is still in progress, do you think it makes sense to make repair-on-all-replica optional? Possibly via a flag on the node that the repair is run?

          Show
          Omid Aladini added a comment - This indeed makes repair across a cluster easier to manage, specially together with -pr ( CASSANDRA-2606 ), but the downside is all replica for a range would be affected once the data is streamed. In my case repair transfers huge amount of data each time (possibly due to Merkle tree precision CASSANDRA-2698 ) causing hundreds of pending compactions that affects reads and counter-writes for the affected range. I'd prefer to have cassandra calculate Merkle trees multiple times (which is possible to throttle) and have faster quorum reads when only one replica is slowed down. Given that incremental repair ( CASSANDRA-2699 ) is still in progress, do you think it makes sense to make repair-on-all-replica optional? Possibly via a flag on the node that the repair is run?
          Hide
          Sylvain Lebresne added a comment -

          I think that you may want to look at -snapshot (CASSANDRA-3721). It sequentialize both the merkle tree and the streaming. In other words, if you have 3 replicas A, B and C for the range, A will compute its merkle tree, then B, then C, and the same for the streaming phase. This exists exactly to avoid the "all replica for the range are slow because they all doing repair stuff".

          Show
          Sylvain Lebresne added a comment - I think that you may want to look at -snapshot ( CASSANDRA-3721 ). It sequentialize both the merkle tree and the streaming. In other words, if you have 3 replicas A, B and C for the range, A will compute its merkle tree, then B, then C, and the same for the streaming phase. This exists exactly to avoid the "all replica for the range are slow because they all doing repair stuff".
          Hide
          Omid Aladini added a comment -

          That's great, didn't know about that. This means by throttling the sequentialized streaming, I might be able to let pending compactions resolve on the previously-repaired replica, although tuning this would be a challenge (possibly by dynamically changing streamthroughput according to pending compactions, although doesn't seem ideal).

          Show
          Omid Aladini added a comment - That's great, didn't know about that. This means by throttling the sequentialized streaming, I might be able to let pending compactions resolve on the previously-repaired replica, although tuning this would be a challenge (possibly by dynamically changing streamthroughput according to pending compactions, although doesn't seem ideal).

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Sylvain Lebresne
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 8h
                8h
                Remaining:
                Remaining Estimate - 8h
                8h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development