There is a clear need to support rate limiting of all background I/O (e.g., compaction, repair). In some cases background I/O is naturally rate limited as a result of being CPU bottlenecked, but in all cases where the CPU is not the bottleneck, background streaming I/O is almost guaranteed (barring a very very smart RAID controller or I/O subsystem that happens to cater extremely well to the use case) to be detrimental to the latency and throughput of regular live traffic (reads).
Ways in which live traffic is negatively affected by backgrounds I/O includes:
- Indirectly by page cache eviction (see e.g.
- Reads are directly detrimental when not otherwise limited for the usual reasons; large continuing read requests that keep coming are battling with latency sensitive live traffic (mostly seek bound). Mixing seek-bound latency critical with bulk streaming is a classic no-no for I/O scheduling.
- Writes are directly detrimental in a similar fashion.
- But in particular, writes are more difficult still: Caching effects tend to augment the effects because lacking any kind of fsync() or direct I/O, the operating system and/or RAID controller tends to defer writes when possible. This often leads to a very sudden throttling of the application when caches are filled, at which point there is potentially a huge backlog of data to write.
- This may evict a lot of data from page cache since dirty buffers cannot be evicted prior to being flushed out (though
CASSANDRA-1470 and related will hopefully help here).
- In particular, one major reason why batter-backed RAID controllers are great is that they have the capability to "eat" storms of writes very quickly and schedule them pretty efficiently with respect to a concurrent continuous stream of reads. But this ability is defeated if we just throw data at it until entirely full. Instead a rate-limited approach means that data can be thrown at said RAID controller at a reasonable pace and it can be allowed to do its job of limiting the impact of those writes on reads.
I propose a mechanism whereby all such backgrounds reads are rate limited in terms of MB/sec throughput. There would be:
- A configuration option to state the target rate (probably a global, until there is support for per-cf sstable placement)
- A configuration option to state the sampling granularity. The granularity would have to be small enough for rate limiting to be effective (i.e., the amount of I/O generated in between each sample must be reasonably small) while large enough to not be expensive (neither in terms of gettimeofday() type over-head, nor in terms of causing smaller writes so that would-be streaming operations become seek bound). There would likely be a recommended value on the order of say 5 MB, with a recommendation to multiply that with the number of disks in the underlying device (5 MB assumes classic mechanical disks).
Because of coarse granularity (= infrequent synchronization), there should not be a significant overhead associated with maintaining shared global rate limiter for the Cassandra instance.