Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19607

Compaction double reads every chunk

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • None
    • None
    • None
    • All
    • None

    Description

      I was taking a look at the I/O operations performed during compaction and noticed it looks like we're double reading every chunk of the filesystem, meaning we perform 2x the system calls that we need to. The second read will come from page cache, but this isn't free, so we should try to avoid it. Here's the steps to reproduce.

      First, wrote a small amount of data to a table and flush. I have a single SSTable that's 151KB here:

      -rw-r--r-- 1 cassandra  127 May  1 21:00 da-8-bti-CompressionInfo.db
      -rw-r--r-- 1 cassandra 151K May  1 21:00 da-8-bti-Data.db
      -rw-r--r-- 1 cassandra   10 May  1 21:00 da-8-bti-Digest.crc32
      -rw-r--r-- 1 cassandra  136 May  1 21:00 da-8-bti-Filter.db
      -rw-r--r-- 1 cassandra  829 May  1 21:00 da-8-bti-Partitions.db
      -rw-r--r-- 1 cassandra    0 May  1 21:00 da-8-bti-Rows.db
      -rw-r--r-- 1 cassandra 5.0K May  1 21:00 da-8-bti-Statistics.db
      -rw-r--r-- 1 cassandra   94 May  1 21:00 da-8-bti-TOC.txt
      

      Then watch all reads by doing the following:

      $ xfsslower -p $PID 0
      

      Next, start a event profiling session on ChannelProxy.read by doing the following with async-profiler:

      $ asprof -e org.apache.cassandra.io.util.ChannelProxy.read -d 10 $(cassandra-pid) -f /mnt/cassandra/artifacts/channelproxy-read.txt
      

      Then I started a compaction on the table which lasted roughly a second.

      I took the log of filesystem operations and pulled out all the accesses to `da-8-bti-Data.db` using ripgrep:

      $ rg da-8-bti-Data.db artifacts/cassandra1/xfsslower-da8.txt
      

      The headings are:

      TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
      

      The data:

      27:21:07:05 CompactionExec 30550  R 16387   0           0.02 da-8-bti-Data.db
      28:21:07:05 CompactionExec 30550  R 16387   0           0.00 da-8-bti-Data.db
      40:21:07:05 CompactionExec 30550  R 16436   16          0.01 da-8-bti-Data.db
      41:21:07:05 CompactionExec 30550  R 16436   16          0.01 da-8-bti-Data.db
      44:21:07:05 CompactionExec 30550  R 16419   32          0.01 da-8-bti-Data.db
      45:21:07:05 CompactionExec 30550  R 16419   32          0.01 da-8-bti-Data.db
      48:21:07:05 CompactionExec 30550  R 16441   48          0.01 da-8-bti-Data.db
      49:21:07:05 CompactionExec 30550  R 16441   48          0.00 da-8-bti-Data.db
      52:21:07:05 CompactionExec 30550  R 16421   64          0.01 da-8-bti-Data.db
      53:21:07:05 CompactionExec 30550  R 16421   64          0.01 da-8-bti-Data.db
      56:21:07:05 CompactionExec 30550  R 16415   80          0.01 da-8-bti-Data.db
      57:21:07:05 CompactionExec 30550  R 16415   80          0.01 da-8-bti-Data.db
      60:21:07:05 CompactionExec 30550  R 16439   96          0.01 da-8-bti-Data.db
      61:21:07:05 CompactionExec 30550  R 16439   96          0.00 da-8-bti-Data.db
      64:21:07:05 CompactionExec 30550  R 16390   112         0.01 da-8-bti-Data.db
      65:21:07:05 CompactionExec 30550  R 16390   112         0.00 da-8-bti-Data.db
      68:21:07:05 CompactionExec 30550  R 16384   128         0.01 da-8-bti-Data.db
      69:21:07:05 CompactionExec 30550  R 16384   128         0.01 da-8-bti-Data.db
      72:21:07:05 CompactionExec 30550  R 6019    144         0.00 da-8-bti-Data.db
      73:21:07:05 CompactionExec 30550  R 6019    144         0.00 da-8-bti-Data.db
      

      There are 20 IO operations reported here, each of which occurs twice. I was originally thinking this might be a bug with the reporting, but other files operations are listed only once. For example:

      21:07:05 CompactionExec 30550  W 16386   112         0.09 da-9-bti-Data.db
      21:07:05 CompactionExec 30550  W 4       128         0.00 da-9-bti-Data.db
      

      Looking at the output, the last read is approximately 6KB and starts at the 144KB offset which matches up to the total length of the file.

      The profiler output indicates there's 2 code paths that arrive at the FS operation 9 times.

      I've also attached the profiler output, which shows 22 calls to org.apache.cassandra.io.util.ChannelProxy.read. Two of these originate in the Stats file, leaving 9 chunks in this file, read twice, plus 2 additional calls.

      Attachments

        1. channelproxy-read-da9-rev.html
          11 kB
          Jon Haddad
        2. channelproxy-read-da8.txt
          14 kB
          Jon Haddad

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rustyrazorblade Jon Haddad
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: