Cassandra
  1. Cassandra
  2. CASSANDRA-4905

Repair should exclude gcable tombstones from merkle-tree computation

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 3
    • Component/s: Core
    • Labels:
      None

      Description

      Currently gcable tombstones get repaired if some replicas compacted already, but some are not compacted.

      This could be avoided by ignoring all gcable tombstones during merkle tree calculation.

      This was discussed with Sylvain on the mailing list:
      http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/repair-compaction-and-tombstone-rows-td7583481.html

      1. 4905.txt
        8 kB
        Sylvain Lebresne

        Issue Links

          Activity

          Hide
          Christian Spriegel added a comment - - edited

          Also I wonder if a expired column could create the digest of a tombstone if it is timed out.

          If a gcable tombstone does not alter the digest, then a timed-out ExpiredColumn should behave like a tombstone.

          Edit: It already works like this

          Show
          Christian Spriegel added a comment - - edited Also I wonder if a expired column could create the digest of a tombstone if it is timed out. If a gcable tombstone does not alter the digest, then a timed-out ExpiredColumn should behave like a tombstone. Edit: It already works like this
          Hide
          Sylvain Lebresne added a comment -

          I wonder if a expired column could create the digest of a tombstone if it is timed out

          This is already pretty much the case in practice. Expired columns are transformed to tombstone at deserialization time, so repair will in fact get a tombstone for any expired columns (unless the column expired between the deserialization and it's use to compute the hash but that doesn't really matter). But truth being told, as far as repair is concerned, it would be better to never transform/consider an expired column as a tombstone, as one node could see an expired column just before expiration while another could see it after expiration, and the fact we do change them to tombstones means that in that case repair will consider them inconsistent.

          Show
          Sylvain Lebresne added a comment - I wonder if a expired column could create the digest of a tombstone if it is timed out This is already pretty much the case in practice. Expired columns are transformed to tombstone at deserialization time, so repair will in fact get a tombstone for any expired columns (unless the column expired between the deserialization and it's use to compute the hash but that doesn't really matter). But truth being told, as far as repair is concerned, it would be better to never transform/consider an expired column as a tombstone, as one node could see an expired column just before expiration while another could see it after expiration, and the fact we do change them to tombstones means that in that case repair will consider them inconsistent.
          Hide
          Sylvain Lebresne added a comment -

          In other words, for existing release, we should probably just do what the title here suggest. But in the long run (because that require adding a new parameter to the network protocol, so at best it can be done for 1.2), we should probably consider having repair agree on a starting timestamp and use that as reference to expire columns and decide if a tombstone is gcable or not.

          Show
          Sylvain Lebresne added a comment - In other words, for existing release, we should probably just do what the title here suggest. But in the long run (because that require adding a new parameter to the network protocol, so at best it can be done for 1.2), we should probably consider having repair agree on a starting timestamp and use that as reference to expire columns and decide if a tombstone is gcable or not.
          Hide
          Jonathan Ellis added a comment -

          Tagging version based on the basic part in the title. Let's open a new ticket for 1.3 if we want to get crazy with protocol changes.

          Show
          Jonathan Ellis added a comment - Tagging version based on the basic part in the title. Let's open a new ticket for 1.3 if we want to get crazy with protocol changes.
          Hide
          Sylvain Lebresne added a comment -

          Patch attached. This does the basic part only, I've opened CASSANDRA-4932 for a slightly better follow-up (which honestly won't improve this by a lot).

          Show
          Sylvain Lebresne added a comment - Patch attached. This does the basic part only, I've opened CASSANDRA-4932 for a slightly better follow-up (which honestly won't improve this by a lot).
          Hide
          Jonathan Ellis added a comment -

          +1 for 1.2, I would rather leave 1.1 alone but I can be convinced otherwise

          Show
          Jonathan Ellis added a comment - +1 for 1.2, I would rather leave 1.1 alone but I can be convinced otherwise
          Hide
          Christian Spriegel added a comment -

          I'm fine with that.

          Show
          Christian Spriegel added a comment - I'm fine with that.
          Hide
          Sylvain Lebresne added a comment -

          Alright then, I'm good too. Rebased and committed to 1.2 only, thanks. The patch it here anyway for that really really want it in 1.1.

          Show
          Sylvain Lebresne added a comment - Alright then, I'm good too. Rebased and committed to 1.2 only, thanks. The patch it here anyway for that really really want it in 1.1.
          Hide
          Michael Theroux added a comment - - edited

          I believe we are hitting a situation where this bug is being problematic in 1.1.9. We have a column family, for historical reasons, we run staggered major compactions on. This column family also has many deletes. We've noticed our bloom filters increasing in size by an amount over time. Bloom filters on a specific node would go down a great deal after a major compaction, only to increase back to near their original level over a few days.

          What I believe is happening is we had a staggered repair schedule, along with a staggered major compaction schedule. The major compaction would remove the tombstones, but the repair would stream them back.

          To test the theory, I adjusted the major compaction schedule to perform a major compaction across all nodes on the same day. This weeks behavior and bloom filter growth has been much better.

          Is there a reason why this patch was not applied to 1.1.X? Are there stability concerns? We aren't ready to make the jump to 1.2, and would prefer not to move this table to Leveled Compaction if we don't have to.

          Show
          Michael Theroux added a comment - - edited I believe we are hitting a situation where this bug is being problematic in 1.1.9. We have a column family, for historical reasons, we run staggered major compactions on. This column family also has many deletes. We've noticed our bloom filters increasing in size by an amount over time. Bloom filters on a specific node would go down a great deal after a major compaction, only to increase back to near their original level over a few days. What I believe is happening is we had a staggered repair schedule, along with a staggered major compaction schedule. The major compaction would remove the tombstones, but the repair would stream them back. To test the theory, I adjusted the major compaction schedule to perform a major compaction across all nodes on the same day. This weeks behavior and bloom filter growth has been much better. Is there a reason why this patch was not applied to 1.1.X? Are there stability concerns? We aren't ready to make the jump to 1.2, and would prefer not to move this table to Leveled Compaction if we don't have to.
          Hide
          Christian Spriegel added a comment -

          Yeah, repair with TTLed columns can be nasty. Since November, I've seen repairs streaming up to 90GB of data for a single repair. According to nodetool, this cluster had no dropped writes. So I would assume it was consistent already.

          Before, Sun Dec 23 08:00:01 UTC 2012:
          192.168.1.1 datacenter1 rack1 Up Normal 404.17 GB 33.33% 0
          192.168.1.2 datacenter1 rack1 Up Normal 410.9 GB 33.33% 56713727820156410577229101238628035242
          192.168.1.3 datacenter1 rack1 Up Normal 404.27 GB 33.33% 113427455640312821154458202477256070484

          After, Sun Dec 23 12:19:38 UTC 2012:
          192.168.1.1 datacenter1 rack1 Up Normal 497.95 GB 33.33% 0
          192.168.1.2 datacenter1 rack1 Up Normal 413.26 GB 33.33% 56713727820156410577229101238628035242
          192.168.1.3 datacenter1 rack1 Up Normal 449.83 GB 33.33% 113427455640312821154458202477256070484

          I'm not saying I want this patch in 1.1. I just wanted to share this rather spectecular repair

          Show
          Christian Spriegel added a comment - Yeah, repair with TTLed columns can be nasty. Since November, I've seen repairs streaming up to 90GB of data for a single repair. According to nodetool, this cluster had no dropped writes. So I would assume it was consistent already. Before, Sun Dec 23 08:00:01 UTC 2012: 192.168.1.1 datacenter1 rack1 Up Normal 404.17 GB 33.33% 0 192.168.1.2 datacenter1 rack1 Up Normal 410.9 GB 33.33% 56713727820156410577229101238628035242 192.168.1.3 datacenter1 rack1 Up Normal 404.27 GB 33.33% 113427455640312821154458202477256070484 After, Sun Dec 23 12:19:38 UTC 2012: 192.168.1.1 datacenter1 rack1 Up Normal 497.95 GB 33.33% 0 192.168.1.2 datacenter1 rack1 Up Normal 413.26 GB 33.33% 56713727820156410577229101238628035242 192.168.1.3 datacenter1 rack1 Up Normal 449.83 GB 33.33% 113427455640312821154458202477256070484 I'm not saying I want this patch in 1.1. I just wanted to share this rather spectecular repair
          Hide
          Robert Coli added a comment -

          Have a 1.1 era cluster with TTL where repair turns ~500gb of actual data into 1.5TB. Would love this merged into 1.1.

          Show
          Robert Coli added a comment - Have a 1.1 era cluster with TTL where repair turns ~500gb of actual data into 1.5TB. Would love this merged into 1.1.
          Hide
          Jonathan Ellis added a comment -

          1.1 is over a year old; we're really shooting for stability over new functionality there now.

          The good news is, by now 1.2.x should be about as stable as 1.1-with-everything-people-want-backported would be.

          Show
          Jonathan Ellis added a comment - 1.1 is over a year old; we're really shooting for stability over new functionality there now. The good news is, by now 1.2.x should be about as stable as 1.1-with-everything-people-want-backported would be.
          Hide
          Michael Theroux added a comment -

          Can anyone comment on the risk of a user (such as myself) backporting this fix and patching locally? The code that was changed in the patch looks identical in 1.1.11.

          We have a situation where a column family with lots of deletes running under leveled compaction. The validation doesn't take too long, but afterwards we get 2k compaction tasks that takes several hours to run, when really there shouldn't be any inconsistency. What I suspect is happening is as tombstones are getting gc_graced they are compacted away on some nodes and not others at the time repair is run. I suspect the majority of the 2k compactions are gc_graced tombstones getting back in-sync.

          I'm setting up a test environment with baseline data, going to reproduce the repair, reset to baseline, and re-run the repair with this patch to see if this is indeed the issue. This might take a few days to setup and run.

          Cassandra is mission and business critical for us. Moving to 1.2 will take some time, as we should setup a test environment, practice migrations and test. We also use the ByteOrderedPartitioner, which in general concerns me as its not the most popular use of Cassandra, and maybe a source of issues as its pounded on less by the general user community.

          Show
          Michael Theroux added a comment - Can anyone comment on the risk of a user (such as myself) backporting this fix and patching locally? The code that was changed in the patch looks identical in 1.1.11. We have a situation where a column family with lots of deletes running under leveled compaction. The validation doesn't take too long, but afterwards we get 2k compaction tasks that takes several hours to run, when really there shouldn't be any inconsistency. What I suspect is happening is as tombstones are getting gc_graced they are compacted away on some nodes and not others at the time repair is run. I suspect the majority of the 2k compactions are gc_graced tombstones getting back in-sync. I'm setting up a test environment with baseline data, going to reproduce the repair, reset to baseline, and re-run the repair with this patch to see if this is indeed the issue. This might take a few days to setup and run. Cassandra is mission and business critical for us. Moving to 1.2 will take some time, as we should setup a test environment, practice migrations and test. We also use the ByteOrderedPartitioner, which in general concerns me as its not the most popular use of Cassandra, and maybe a source of issues as its pounded on less by the general user community.
          Hide
          Michael Theroux added a comment -

          To followup on my previous comment, I performed the test I described. The results were quite incredible. I brought up three nodes that represented one token range and its replicas. These were brought up from very recent snapshots, so some inconsistency was expected. I ran the test twice with and without the patch, and on the same data, and periodically monitored the number of pending compaction tasks. Below you see a time that it was monitored, and the number of pending compactions during that time

          Without the Patch:

          start: 10:30
          11:22 - 196
          12:33 - 112
          13:19 - 1558
          14:03 - 1579
          14:48 - 1356
          15:25 - 1181
          16:49 - 752
          17:30 - 657
          17:52 - 548
          18:56 - 202
          19:36 - 29
          01:50 - 0

          With the patch:

          start: 3:47
          4:34 - 1
          4:40 - 1
          4:50 - 32
          5:01 - 209
          5:54 - 1
          6:50 - 3 (all streaming from compaction complete)
          6:54 - Repair complete, no compactions

          Not only was this a very efficient repair from the point of view of the number of compactions, it also completed in a little over 3 hours, which is equally as dramatic (validation typically lasts several hours for us).

          Show
          Michael Theroux added a comment - To followup on my previous comment, I performed the test I described. The results were quite incredible. I brought up three nodes that represented one token range and its replicas. These were brought up from very recent snapshots, so some inconsistency was expected. I ran the test twice with and without the patch, and on the same data, and periodically monitored the number of pending compaction tasks. Below you see a time that it was monitored, and the number of pending compactions during that time Without the Patch: start: 10:30 11:22 - 196 12:33 - 112 13:19 - 1558 14:03 - 1579 14:48 - 1356 15:25 - 1181 16:49 - 752 17:30 - 657 17:52 - 548 18:56 - 202 19:36 - 29 01:50 - 0 With the patch: start: 3:47 4:34 - 1 4:40 - 1 4:50 - 32 5:01 - 209 5:54 - 1 6:50 - 3 (all streaming from compaction complete) 6:54 - Repair complete, no compactions Not only was this a very efficient repair from the point of view of the number of compactions, it also completed in a little over 3 hours, which is equally as dramatic (validation typically lasts several hours for us).
          Hide
          Christian Spriegel added a comment -

          Michael Theroux: Thanks for sharing your results. Out of curiosity, might I ask how much data you had on these nodes? I assume its lots of wide rows using ttl?

          Show
          Christian Spriegel added a comment - Michael Theroux : Thanks for sharing your results. Out of curiosity, might I ask how much data you had on these nodes? I assume its lots of wide rows using ttl?
          Hide
          Michael Theroux added a comment -

          We have a large variety of usecases that have different table characteristics. We do have tables with wide rows, and do have tables with ttls, but don't have tables with wide rows and ttls together.

          Nodetool ring shows we have 280GB per node, or a little less.

          If I'm understanding the issue though, all you would need is large numbers of tombstones getting compacted and deleted from one node, and not the others. The table that has been giving us the most grief uses LeveledCompaction, which makes a guarantee that at most 10% of space will be wasted by obsolete rows. Given that there is no guarantee that the 10% of rows on one node overlaps with the 10% of rows on another node, could it be possible that this causes the massive repairs?

          Show
          Michael Theroux added a comment - We have a large variety of usecases that have different table characteristics. We do have tables with wide rows, and do have tables with ttls, but don't have tables with wide rows and ttls together. Nodetool ring shows we have 280GB per node, or a little less. If I'm understanding the issue though, all you would need is large numbers of tombstones getting compacted and deleted from one node, and not the others. The table that has been giving us the most grief uses LeveledCompaction, which makes a guarantee that at most 10% of space will be wasted by obsolete rows. Given that there is no guarantee that the 10% of rows on one node overlaps with the 10% of rows on another node, could it be possible that this causes the massive repairs?
          Hide
          Christian Spriegel added a comment -

          Michael Theroux: 15 hours for 280GB sounds bad: That is effectively <2MB/s throughput (assuming -pr and RF3), right? Ouch

          Your understanding is correct. Any tombstone can cause it, its not just TTLed columns.

          My assumption was that time-series data would be the worst case scenario, because repair would always stream entire wide rows. Reading your message I realized that random deletes are probably worse because they will cause hash differences for more keys, spread across the ring. The merkletree-inaccuracy will then make repair stream larger portions of data for each of these mismatches.

          In general, shouldn't leveled compaction behave better than size-tiered, because it keeps compaction more up-to-date?

          If you do deletes, did you also look at CASSANDRA-5398?

          Show
          Christian Spriegel added a comment - Michael Theroux : 15 hours for 280GB sounds bad: That is effectively <2MB/s throughput (assuming -pr and RF3), right? Ouch Your understanding is correct. Any tombstone can cause it, its not just TTLed columns. My assumption was that time-series data would be the worst case scenario, because repair would always stream entire wide rows. Reading your message I realized that random deletes are probably worse because they will cause hash differences for more keys, spread across the ring. The merkletree-inaccuracy will then make repair stream larger portions of data for each of these mismatches. In general, shouldn't leveled compaction behave better than size-tiered, because it keeps compaction more up-to-date? If you do deletes, did you also look at CASSANDRA-5398 ?
          Hide
          Michael Theroux added a comment - - edited

          The long compactions before the fix I think is a byproduct of leveled compaction. I've seen a number of people mention this on the users list. Basically, leveled compaction in 1.1 is a single threaded process, and increasing the compaction throughput doesn't help its rate. Leveled compaction is very slow to compact.

          Leveled compaction should be better than Size Tiered, unless you are doing something like major compactions (we are on some tables).

          CASSANDRA-5398 looks interesting. We rolled this fix + 1.1.11 into production this weekend. The last repair was a thing of beauty... finished in under 3 hours, very little streaming and compaction... as it should be if you have don't have any, or very few inconsistencies in your data. Given its running so well, I'll leave well-enough alone and not apply 5398.

          We are using RF 3 and the repair was using -pr.

          Show
          Michael Theroux added a comment - - edited The long compactions before the fix I think is a byproduct of leveled compaction. I've seen a number of people mention this on the users list. Basically, leveled compaction in 1.1 is a single threaded process, and increasing the compaction throughput doesn't help its rate. Leveled compaction is very slow to compact. Leveled compaction should be better than Size Tiered, unless you are doing something like major compactions (we are on some tables). CASSANDRA-5398 looks interesting. We rolled this fix + 1.1.11 into production this weekend. The last repair was a thing of beauty... finished in under 3 hours, very little streaming and compaction... as it should be if you have don't have any, or very few inconsistencies in your data. Given its running so well, I'll leave well-enough alone and not apply 5398. We are using RF 3 and the repair was using -pr.
          Hide
          Michael Theroux added a comment -

          One additional thought. It is possible that LeveledCompaction could make this issue worse, because it is more efficient in deleting tombstones? It is more efficient, but its not 100%, so is the chance that a tombstone was deleted on one node, and not deleted on the other two nodes (in the case of a RF 3), actually greater than SizeTiered? I guess it depends on a lot... just a thought.

          Show
          Michael Theroux added a comment - One additional thought. It is possible that LeveledCompaction could make this issue worse, because it is more efficient in deleting tombstones? It is more efficient, but its not 100%, so is the chance that a tombstone was deleted on one node, and not deleted on the other two nodes (in the case of a RF 3), actually greater than SizeTiered? I guess it depends on a lot... just a thought.

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Christian Spriegel
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development