Cassandra
  1. Cassandra
  2. CASSANDRA-3635

Throttle validation separately from other compaction

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Incomplete
    • Fix Version/s: None
    • Component/s: Core
    • Labels:

      Description

      Validation compaction is fairly ressource intensive. It is possible to throttle it with other compaction, but there is cases where you really want to throttle it rather aggressively but don't necessarily want to have minor compactions throttled that much. The goal is to (optionally) allow to set a separate throttling value for validation.

      PS: I'm not pretending this will solve every repair problem or anything.

        Activity

        Hide
        Sylvain Lebresne added a comment -

        Patch attached against 0.8. It is a fairly simple patch and repair is actually worst in 0.8 than in 1.0, so I think it's not unreasonable to put this in 0.8. That is unless someone feels very strongly against it of course.

        Show
        Sylvain Lebresne added a comment - Patch attached against 0.8. It is a fairly simple patch and repair is actually worst in 0.8 than in 1.0, so I think it's not unreasonable to put this in 0.8. That is unless someone feels very strongly against it of course.
        Hide
        Jonathan Ellis added a comment -

        I don't think we should put this in 0.8. The repair problems there are a lot deeper than this. I'm fine with posting a backport patch if people want to run a custom build with it, but this shouldn't go in anything earlier than 1.0. (I'd prefer 1.1 TBH.)

        Since compaction throughput does not include validation anymore, I'd prefer to default to something like 12/4 instead of effectively increasing the impact of compaction + repair out of the box.

        Show
        Jonathan Ellis added a comment - I don't think we should put this in 0.8. The repair problems there are a lot deeper than this. I'm fine with posting a backport patch if people want to run a custom build with it, but this shouldn't go in anything earlier than 1.0. (I'd prefer 1.1 TBH.) Since compaction throughput does not include validation anymore, I'd prefer to default to something like 12/4 instead of effectively increasing the impact of compaction + repair out of the box.
        Hide
        Jonathan Ellis added a comment -

        (+1 otherwise)

        Show
        Jonathan Ellis added a comment - (+1 otherwise)
        Hide
        Sylvain Lebresne added a comment -

        Since compaction throughput does not include validation anymore, I'd prefer to default to something like 12/4 instead of effectively increasing the impact of compaction + repair out of the box.

        By default, if validation_throughput_mb_per_sec, validation are still included with all compaction, so the default shouldn't change anything. The rational was to 1) ease upgrade and 2) simplify configuration for people that don't care about that. But now that I think of it, the fact that compaction_throughput_mb_per_sec affects validation compaction only if the validation specific setting is not set is probably a bit confusing. Maybe the more natural way to do this is to have compaction_throughput_mb_per_sec be the total maximum throughput for all compaction, and have 0 <= validation_throughput_mb_per_sec <= compaction_throughput_mb_per_sec be a way to further throttle validation compaction.

        Show
        Sylvain Lebresne added a comment - Since compaction throughput does not include validation anymore, I'd prefer to default to something like 12/4 instead of effectively increasing the impact of compaction + repair out of the box. By default, if validation_throughput_mb_per_sec, validation are still included with all compaction, so the default shouldn't change anything. The rational was to 1) ease upgrade and 2) simplify configuration for people that don't care about that. But now that I think of it, the fact that compaction_throughput_mb_per_sec affects validation compaction only if the validation specific setting is not set is probably a bit confusing. Maybe the more natural way to do this is to have compaction_throughput_mb_per_sec be the total maximum throughput for all compaction, and have 0 <= validation_throughput_mb_per_sec <= compaction_throughput_mb_per_sec be a way to further throttle validation compaction.
        Hide
        Jonathan Ellis added a comment -

        Taking a step back, I'm not sure I see the benefit here. If we're okay with X MB/s of i/o going on, doesn't that disrupt reads just as much whether that comes from repair validation or "ordinary" compaction?

        Show
        Jonathan Ellis added a comment - Taking a step back, I'm not sure I see the benefit here. If we're okay with X MB/s of i/o going on, doesn't that disrupt reads just as much whether that comes from repair validation or "ordinary" compaction?
        Hide
        Sylvain Lebresne added a comment -

        I guess part of the idea is that validation is a bit cpu intensive (due to the SHA-256 hash it does), so that allows to limit that too without being a problem for other compaction. It also allows giving more room for ordinary compactions, so that they complete earlier, which will impact reads (while having validation finishing quickly is not necessarily as important).

        Show
        Sylvain Lebresne added a comment - I guess part of the idea is that validation is a bit cpu intensive (due to the SHA-256 hash it does), so that allows to limit that too without being a problem for other compaction. It also allows giving more room for ordinary compactions, so that they complete earlier, which will impact reads (while having validation finishing quickly is not necessarily as important).
        Hide
        Jonathan Ellis added a comment -

        If you're i/o bound under size-tiered compaction you're kind of screwed since it does such a poor job of actually bucketing the same rows together.

        I think we should get some feedback of the "here's what my workload like and this diminishes my repair pain" nature before committing this. Again, I'm fine with posting a 0.8 version of the patch if that helps.

        Show
        Jonathan Ellis added a comment - If you're i/o bound under size-tiered compaction you're kind of screwed since it does such a poor job of actually bucketing the same rows together. I think we should get some feedback of the "here's what my workload like and this diminishes my repair pain" nature before committing this. Again, I'm fine with posting a 0.8 version of the patch if that helps.
        Hide
        Sylvain Lebresne added a comment -

        I think we should get some feedback of the "here's what my workload like and this diminishes my repair pain" nature before committing this.

        I'm totally fine with that.

        Again, I'm fine with posting a 0.8 version of the patch if that helps.

        The currently attached patch is against 0.8.

        Show
        Sylvain Lebresne added a comment - I think we should get some feedback of the "here's what my workload like and this diminishes my repair pain" nature before committing this. I'm totally fine with that. Again, I'm fine with posting a 0.8 version of the patch if that helps. The currently attached patch is against 0.8.
        Hide
        Vijay added a comment -

        I think it will be much better if we can prioritize, long running compaction vs normal compaction, lets sat we have

        10MB Compaction limit
        2MB Validation compaction limit

        2MB is the limit for the validation for a while and when normal compaction kicks in we might want to hold the validation and do the compction complete because that will affect the read performance and continue with the validation compaction after that. by doing this we can set something like

        12MB Compaction limit
        6 MB Validation compaction limit

        and still be within the HDD limit of 12MB.
        The good thing about normal compaction is that it is spread out and not all the nodes are not involved in it.

        I am starting to think that we can do repairs one by one for a range (within a region), so the traffic doesnt get stuck waiting for the IO. Hope it makes sense.

        Show
        Vijay added a comment - I think it will be much better if we can prioritize, long running compaction vs normal compaction, lets sat we have 10MB Compaction limit 2MB Validation compaction limit 2MB is the limit for the validation for a while and when normal compaction kicks in we might want to hold the validation and do the compction complete because that will affect the read performance and continue with the validation compaction after that. by doing this we can set something like 12MB Compaction limit 6 MB Validation compaction limit and still be within the HDD limit of 12MB. The good thing about normal compaction is that it is spread out and not all the nodes are not involved in it. I am starting to think that we can do repairs one by one for a range (within a region), so the traffic doesnt get stuck waiting for the IO. Hope it makes sense.
        Hide
        Jonathan Ellis added a comment -

        I am starting to think that we can do repairs one by one for a range (within a region

        You mean if you have replicas A B C, comparing A and B before comparing A and C? The downside there is you now have to validate twice, or they will be too out of sync.

        Show
        Jonathan Ellis added a comment - I am starting to think that we can do repairs one by one for a range (within a region You mean if you have replicas A B C, comparing A and B before comparing A and C? The downside there is you now have to validate twice, or they will be too out of sync.
        Hide
        Vijay added a comment -

        Nope, I think we can create a tree independent on the nodes and then compare it

        Lets say we create a tree on A first after completion, we can create a tree on B and then on C (We have to sync on time, may be flush at the time when the repair was requested or something like that).

        Once we have all the 3 Trees we can compare and transfer which as required. Once we have all the trees we can exchange the trees and then start real streaming if needed. That way we dont bring the whole range down or hot.

        Show
        Vijay added a comment - Nope, I think we can create a tree independent on the nodes and then compare it Lets say we create a tree on A first after completion, we can create a tree on B and then on C (We have to sync on time, may be flush at the time when the repair was requested or something like that). Once we have all the 3 Trees we can compare and transfer which as required. Once we have all the trees we can exchange the trees and then start real streaming if needed. That way we dont bring the whole range down or hot.
        Hide
        Sylvain Lebresne added a comment -

        Lets say we create a tree on A first after completion, we can create a tree on B and then on C

        In theory we kind of could. We do need to make sure trees are computed on roughly the same data on all nodes, so we'll need to keep the flush at the same time, but then we don't have to start the computation on all node right away. However, for that to work, we would need to keep references on the sstables after the initial flush which adds it's sets of complication: if for some reason a node never receive it's 'you can start computing your tree' message, it will keep some sstables around forever. We can add a number of protection so that this never happen, but still potentially a very nasty effect.

        In any case, probably not a discussion related to this ticket.

        Show
        Sylvain Lebresne added a comment - Lets say we create a tree on A first after completion, we can create a tree on B and then on C In theory we kind of could. We do need to make sure trees are computed on roughly the same data on all nodes, so we'll need to keep the flush at the same time, but then we don't have to start the computation on all node right away. However, for that to work, we would need to keep references on the sstables after the initial flush which adds it's sets of complication: if for some reason a node never receive it's 'you can start computing your tree' message, it will keep some sstables around forever. We can add a number of protection so that this never happen, but still potentially a very nasty effect. In any case, probably not a discussion related to this ticket.
        Hide
        Jonathan Ellis added a comment -

        I think we should get some feedback of the "here's what my workload like and this diminishes my repair pain" nature before committing this

        Resolving as incomplete in the meantime.

        For the record, I think incremental repair as proposed in CASSANDRA-3912 is a more promising approach overall.

        Show
        Jonathan Ellis added a comment - I think we should get some feedback of the "here's what my workload like and this diminishes my repair pain" nature before committing this Resolving as incomplete in the meantime. For the record, I think incremental repair as proposed in CASSANDRA-3912 is a more promising approach overall.

          People

          • Assignee:
            Unassigned
            Reporter:
            Sylvain Lebresne
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development