Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-19494

Optimize I/O during table scans

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • None
    • None
    • None
    • All
    • None

    Description

      The storage engine reads chunk by chunk during table scans.  We'd be much better off if we could perform larger I/O operations to an internal buffer, perform fewer I/O operations, and avoid making excessive system calls.

      For example, doing a scan against this table:

      CREATE TABLE easy_cass_stress.keyvalue (
          key text PRIMARY KEY,
          value text
      ) WITH additional_write_policy = '99p'
          AND allow_auto_snapshot = true
          AND bloom_filter_fp_chance = 0.01
          AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
          AND cdc = false
          AND comment = ''
          AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
          AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
          AND memtable = 'default'
          AND crc_check_chance = 1.0
          AND default_time_to_live = 0
          AND extensions = {}
          AND gc_grace_seconds = 864000
          AND incremental_backups = true
          AND max_index_interval = 2048
          AND memtable_flush_period_in_ms = 0
          AND min_index_interval = 128
          AND read_repair = 'BLOCKING'
          AND speculative_retry = '99p';

      I see the following I/O activity (sample only, see attachment for full accounting of all reads)

       

      TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
      16:59:23 ReadStage-2    2523   R 15051   0           0.02 nb-6-big-Data.db
      16:59:23 ReadStage-2    2523   R 15049   0           0.01 nb-8-big-Data.db
      16:59:23 ReadStage-2    2523   R 15025   0           0.01 nb-5-big-Data.db
      16:59:23 ReadStage-2    2523   R 15064   0           0.01 nb-7-big-Data.db
      16:59:25 ReadStage-2    2523   R 15051   0           0.01 nb-6-big-Data.db
      16:59:25 ReadStage-2    2523   R 15049   0           0.01 nb-8-big-Data.db
      16:59:25 ReadStage-2    2523   R 15025   0           0.01 nb-5-big-Data.db
      16:59:25 ReadStage-2    2523   R 15064   0           0.00 nb-7-big-Data.db
      16:59:25 ReadStage-2    2523   R 15064   14          0.01 nb-5-big-Data.db
      16:59:25 ReadStage-2    2523   R 15051   0           0.01 nb-6-big-Data.db
      16:59:25 ReadStage-2    2523   R 15049   0           0.00 nb-8-big-Data.db
      16:59:25 ReadStage-2    2523   R 15064   14          0.00 nb-5-big-Data.db
      16:59:25 ReadStage-2    2523   R 15064   0           0.00 nb-7-big-Data.db
      16:59:25 ReadStage-2    2523   R 15012   29          0.01 nb-5-big-Data.db

      with a sample of our off-cpu time looking like this (after dropping caches)

      cpudist -O -p $(cassandra-pid) -m 1 30
      
           msecs               : count     distribution
               0 -> 1          : 5259     |****************************************|
               2 -> 3          : 486      |***                                     |
               4 -> 7          : 0        |                                        |
               8 -> 15         : 1        |                                        |
              16 -> 31         : 0        |                                        |
              32 -> 63         : 29       |                                        |
              64 -> 127        : 77       |                                        |
             128 -> 255        : 4        |                                        |
             256 -> 511        : 6        |                                        |
             512 -> 1023       : 6        |                                        |

      We pay a pretty serious throughput penalty for excessive I/O.  

      We should be able to leverage the work in CASSANDRA-15452 for this.

      Attachments

        1. reads.txt
          38 kB
          Jon Haddad

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rustyrazorblade Jon Haddad
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: