Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      There is a clear need to support rate limiting of all background I/O (e.g., compaction, repair). In some cases background I/O is naturally rate limited as a result of being CPU bottlenecked, but in all cases where the CPU is not the bottleneck, background streaming I/O is almost guaranteed (barring a very very smart RAID controller or I/O subsystem that happens to cater extremely well to the use case) to be detrimental to the latency and throughput of regular live traffic (reads).

      Ways in which live traffic is negatively affected by backgrounds I/O includes:

      • Indirectly by page cache eviction (see e.g. CASSANDRA-1470).
      • Reads are directly detrimental when not otherwise limited for the usual reasons; large continuing read requests that keep coming are battling with latency sensitive live traffic (mostly seek bound). Mixing seek-bound latency critical with bulk streaming is a classic no-no for I/O scheduling.
      • Writes are directly detrimental in a similar fashion.
      • But in particular, writes are more difficult still: Caching effects tend to augment the effects because lacking any kind of fsync() or direct I/O, the operating system and/or RAID controller tends to defer writes when possible. This often leads to a very sudden throttling of the application when caches are filled, at which point there is potentially a huge backlog of data to write.
        • This may evict a lot of data from page cache since dirty buffers cannot be evicted prior to being flushed out (though CASSANDRA-1470 and related will hopefully help here).
        • In particular, one major reason why batter-backed RAID controllers are great is that they have the capability to "eat" storms of writes very quickly and schedule them pretty efficiently with respect to a concurrent continuous stream of reads. But this ability is defeated if we just throw data at it until entirely full. Instead a rate-limited approach means that data can be thrown at said RAID controller at a reasonable pace and it can be allowed to do its job of limiting the impact of those writes on reads.

      I propose a mechanism whereby all such backgrounds reads are rate limited in terms of MB/sec throughput. There would be:

      • A configuration option to state the target rate (probably a global, until there is support for per-cf sstable placement)
      • A configuration option to state the sampling granularity. The granularity would have to be small enough for rate limiting to be effective (i.e., the amount of I/O generated in between each sample must be reasonably small) while large enough to not be expensive (neither in terms of gettimeofday() type over-head, nor in terms of causing smaller writes so that would-be streaming operations become seek bound). There would likely be a recommended value on the order of say 5 MB, with a recommendation to multiply that with the number of disks in the underlying device (5 MB assumes classic mechanical disks).

      Because of coarse granularity (= infrequent synchronization), there should not be a significant overhead associated with maintaining shared global rate limiter for the Cassandra instance.

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          closing in favor of CASSANDRA-2156

          Show
          Jonathan Ellis added a comment - closing in favor of CASSANDRA-2156
          Hide
          Stu Hood added a comment -

          Oops, couldn't find this one before... CASSANDRA-2156 is sort of a dupe.

          Show
          Stu Hood added a comment - Oops, couldn't find this one before... CASSANDRA-2156 is sort of a dupe.
          Hide
          Peter Schuller added a comment -

          Apologies, still not enough to submit a useful/interesting patch. The limited time I've squeezed in for cassandra has gone to other tickets. I will try to have a draft patch soon that is at least ready for some interface/design input even if not exhaustively tested.

          Show
          Peter Schuller added a comment - Apologies, still not enough to submit a useful/interesting patch. The limited time I've squeezed in for cassandra has gone to other tickets. I will try to have a draft patch soon that is at least ready for some interface/design input even if not exhaustively tested.
          Hide
          Jonathan Ellis added a comment -

          How is this looking, Peter?

          Show
          Jonathan Ellis added a comment - How is this looking, Peter?
          Hide
          Peter Schuller added a comment -

          (First, haven't done further work yet because I'm away traveling and not really doing development.)

          Jake: Thanks. However I'm pretty skeptical as io niceness only gives a very very coarse way of specifying what you want. So even if it worked beautifully in some particular case, it won't in others, and there is no good way to control it AFAIK.

          For example, the very first test I did (writing at a fixed speed at fixed chunk size concurrently with seek-bound small reads) failed miserably by completely starving the writes (and this was without ionice) until I switched away from cfq to noop or deadline because cfq refused to actually submit I/O requests to the device to do it's own scheduling based on better information (more on that in a future comment). The support for io nice is specific to cfq btw.

          I don't want to talk too many specifics yet because I want to do some more testing and try a bit harder to make cfq do what I want before I start making claims, but I think that in general, rate limiting I/O in such a way that you get sufficient throughput while not having a too adverse effect on foreground reads is going to take some runtime tuning depending on both workload and hardware (e.g., lone disk vs. 6 disk RAID10 are entirely different matters). I think that simply telling the kernel to de-prioritize the compaction workload might work well in some very specific situations (exactly the right kernel version, io scheduler choice/parameters, workloads and underlying storage device), but not in general.

          More to come. Hopefully with some Python code + sysbench command lines for easy testing by others on differing hardware setups. (I have not yet tested with a real rate limited cassandra, but did testing with sysbench for reads and a Python writer doing chunk-size I/O with fsync(). Test done on raid5/raid10 and with xfs and ext4 (not all permutations). While file system choice impacts somewhat, all results instantly got useless once I realized the I/O scheduling was orders of magnitude more important.

          Show
          Peter Schuller added a comment - (First, haven't done further work yet because I'm away traveling and not really doing development.) Jake: Thanks. However I'm pretty skeptical as io niceness only gives a very very coarse way of specifying what you want. So even if it worked beautifully in some particular case, it won't in others, and there is no good way to control it AFAIK. For example, the very first test I did (writing at a fixed speed at fixed chunk size concurrently with seek-bound small reads) failed miserably by completely starving the writes (and this was without ionice) until I switched away from cfq to noop or deadline because cfq refused to actually submit I/O requests to the device to do it's own scheduling based on better information (more on that in a future comment). The support for io nice is specific to cfq btw. I don't want to talk too many specifics yet because I want to do some more testing and try a bit harder to make cfq do what I want before I start making claims, but I think that in general, rate limiting I/O in such a way that you get sufficient throughput while not having a too adverse effect on foreground reads is going to take some runtime tuning depending on both workload and hardware (e.g., lone disk vs. 6 disk RAID10 are entirely different matters). I think that simply telling the kernel to de-prioritize the compaction workload might work well in some very specific situations (exactly the right kernel version, io scheduler choice/parameters, workloads and underlying storage device), but not in general. More to come. Hopefully with some Python code + sysbench command lines for easy testing by others on differing hardware setups. (I have not yet tested with a real rate limited cassandra, but did testing with sysbench for reads and a Python writer doing chunk-size I/O with fsync(). Test done on raid5/raid10 and with xfs and ext4 (not all permutations). While file system choice impacts somewhat, all results instantly got useless once I realized the I/O scheduling was orders of magnitude more important.
          Hide
          T Jake Luciani added a comment -

          Peter, I need to dig into this but i think it could also be done via the http://linux.die.net/man/2/ioprio_set call in linux for the compaction thread. Obviously not as portable, but could be a quick win.

          Show
          T Jake Luciani added a comment - Peter, I need to dig into this but i think it could also be done via the http://linux.die.net/man/2/ioprio_set call in linux for the compaction thread. Obviously not as portable, but could be a quick win.
          Hide
          Jonathan Ellis added a comment -

          Thanks Peter, looking forward to seeing the patch.

          (I do think we should be able to come up with a reasonable sampling granularity and not have to proliferate options for that part, but the overall plan sounds fine.)

          Show
          Jonathan Ellis added a comment - Thanks Peter, looking forward to seeing the patch. (I do think we should be able to come up with a reasonable sampling granularity and not have to proliferate options for that part, but the overall plan sounds fine.)
          Hide
          Peter Schuller added a comment -

          Assigning to myself. I've begun a very simple implementation with a bytes-per-second target and a fixed quantum, without any attempts to fairness between concurrent I/O nor any attempts to achieve an average rate equal to target maximum rate.

          Show
          Peter Schuller added a comment - Assigning to myself. I've begun a very simple implementation with a bytes-per-second target and a fixed quantum, without any attempts to fairness between concurrent I/O nor any attempts to achieve an average rate equal to target maximum rate.

            People

            • Assignee:
              Unassigned
              Reporter:
              Peter Schuller
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development