Details
-
Improvement
-
Status: Resolved
-
Normal
-
Resolution: Duplicate
-
None
-
None
-
None
-
All
-
None
Description
The storage engine reads chunk by chunk during table scans. We'd be much better off if we could perform larger I/O operations to an internal buffer, perform fewer I/O operations, and avoid making excessive system calls.
For example, doing a scan against this table:
CREATE TABLE easy_cass_stress.keyvalue ( key text PRIMARY KEY, value text ) WITH additional_write_policy = '99p' AND allow_auto_snapshot = true AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND cdc = false AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND memtable = 'default' AND crc_check_chance = 1.0 AND default_time_to_live = 0 AND extensions = {} AND gc_grace_seconds = 864000 AND incremental_backups = true AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair = 'BLOCKING' AND speculative_retry = '99p';
I see the following I/O activity (sample only, see attachment for full accounting of all reads)
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 16:59:23 ReadStage-2 2523 R 15051 0 0.02 nb-6-big-Data.db 16:59:23 ReadStage-2 2523 R 15049 0 0.01 nb-8-big-Data.db 16:59:23 ReadStage-2 2523 R 15025 0 0.01 nb-5-big-Data.db 16:59:23 ReadStage-2 2523 R 15064 0 0.01 nb-7-big-Data.db 16:59:25 ReadStage-2 2523 R 15051 0 0.01 nb-6-big-Data.db 16:59:25 ReadStage-2 2523 R 15049 0 0.01 nb-8-big-Data.db 16:59:25 ReadStage-2 2523 R 15025 0 0.01 nb-5-big-Data.db 16:59:25 ReadStage-2 2523 R 15064 0 0.00 nb-7-big-Data.db 16:59:25 ReadStage-2 2523 R 15064 14 0.01 nb-5-big-Data.db 16:59:25 ReadStage-2 2523 R 15051 0 0.01 nb-6-big-Data.db 16:59:25 ReadStage-2 2523 R 15049 0 0.00 nb-8-big-Data.db 16:59:25 ReadStage-2 2523 R 15064 14 0.00 nb-5-big-Data.db 16:59:25 ReadStage-2 2523 R 15064 0 0.00 nb-7-big-Data.db 16:59:25 ReadStage-2 2523 R 15012 29 0.01 nb-5-big-Data.db
with a sample of our off-cpu time looking like this (after dropping caches)
cpudist -O -p $(cassandra-pid) -m 1 30 msecs : count distribution 0 -> 1 : 5259 |****************************************| 2 -> 3 : 486 |*** | 4 -> 7 : 0 | | 8 -> 15 : 1 | | 16 -> 31 : 0 | | 32 -> 63 : 29 | | 64 -> 127 : 77 | | 128 -> 255 : 4 | | 256 -> 511 : 6 | | 512 -> 1023 : 6 | |
We pay a pretty serious throughput penalty for excessive I/O.
We should be able to leverage the work in CASSANDRA-15452 for this.
Attachments
Attachments
Issue Links
- duplicates
-
CASSANDRA-15452 Improve disk access patterns during compaction (big format)
- Review In Progress