Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Duplicate
-
None
-
None
-
None
-
All
-
None
Description
I was taking a look at the I/O operations performed during compaction and noticed it looks like we're double reading every chunk of the filesystem, meaning we perform 2x the system calls that we need to. The second read will come from page cache, but this isn't free, so we should try to avoid it. Here's the steps to reproduce.
First, wrote a small amount of data to a table and flush. I have a single SSTable that's 151KB here:
-rw-r--r-- 1 cassandra 127 May 1 21:00 da-8-bti-CompressionInfo.db -rw-r--r-- 1 cassandra 151K May 1 21:00 da-8-bti-Data.db -rw-r--r-- 1 cassandra 10 May 1 21:00 da-8-bti-Digest.crc32 -rw-r--r-- 1 cassandra 136 May 1 21:00 da-8-bti-Filter.db -rw-r--r-- 1 cassandra 829 May 1 21:00 da-8-bti-Partitions.db -rw-r--r-- 1 cassandra 0 May 1 21:00 da-8-bti-Rows.db -rw-r--r-- 1 cassandra 5.0K May 1 21:00 da-8-bti-Statistics.db -rw-r--r-- 1 cassandra 94 May 1 21:00 da-8-bti-TOC.txt
Then watch all reads by doing the following:
$ xfsslower -p $PID 0
Next, start a event profiling session on ChannelProxy.read by doing the following with async-profiler:
$ asprof -e org.apache.cassandra.io.util.ChannelProxy.read -d 10 $(cassandra-pid) -f /mnt/cassandra/artifacts/channelproxy-read.txt
Then I started a compaction on the table which lasted roughly a second.
I took the log of filesystem operations and pulled out all the accesses to `da-8-bti-Data.db` using ripgrep:
$ rg da-8-bti-Data.db artifacts/cassandra1/xfsslower-da8.txt
The headings are:
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
The data:
27:21:07:05 CompactionExec 30550 R 16387 0 0.02 da-8-bti-Data.db 28:21:07:05 CompactionExec 30550 R 16387 0 0.00 da-8-bti-Data.db 40:21:07:05 CompactionExec 30550 R 16436 16 0.01 da-8-bti-Data.db 41:21:07:05 CompactionExec 30550 R 16436 16 0.01 da-8-bti-Data.db 44:21:07:05 CompactionExec 30550 R 16419 32 0.01 da-8-bti-Data.db 45:21:07:05 CompactionExec 30550 R 16419 32 0.01 da-8-bti-Data.db 48:21:07:05 CompactionExec 30550 R 16441 48 0.01 da-8-bti-Data.db 49:21:07:05 CompactionExec 30550 R 16441 48 0.00 da-8-bti-Data.db 52:21:07:05 CompactionExec 30550 R 16421 64 0.01 da-8-bti-Data.db 53:21:07:05 CompactionExec 30550 R 16421 64 0.01 da-8-bti-Data.db 56:21:07:05 CompactionExec 30550 R 16415 80 0.01 da-8-bti-Data.db 57:21:07:05 CompactionExec 30550 R 16415 80 0.01 da-8-bti-Data.db 60:21:07:05 CompactionExec 30550 R 16439 96 0.01 da-8-bti-Data.db 61:21:07:05 CompactionExec 30550 R 16439 96 0.00 da-8-bti-Data.db 64:21:07:05 CompactionExec 30550 R 16390 112 0.01 da-8-bti-Data.db 65:21:07:05 CompactionExec 30550 R 16390 112 0.00 da-8-bti-Data.db 68:21:07:05 CompactionExec 30550 R 16384 128 0.01 da-8-bti-Data.db 69:21:07:05 CompactionExec 30550 R 16384 128 0.01 da-8-bti-Data.db 72:21:07:05 CompactionExec 30550 R 6019 144 0.00 da-8-bti-Data.db 73:21:07:05 CompactionExec 30550 R 6019 144 0.00 da-8-bti-Data.db
There are 20 IO operations reported here, each of which occurs twice. I was originally thinking this might be a bug with the reporting, but other files operations are listed only once. For example:
21:07:05 CompactionExec 30550 W 16386 112 0.09 da-9-bti-Data.db 21:07:05 CompactionExec 30550 W 4 128 0.00 da-9-bti-Data.db
Looking at the output, the last read is approximately 6KB and starts at the 144KB offset which matches up to the total length of the file.
The profiler output indicates there's 2 code paths that arrive at the FS operation 9 times.
I've also attached the profiler output, which shows 22 calls to org.apache.cassandra.io.util.ChannelProxy.read. Two of these originate in the Stats file, leaving 9 chunks in this file, read twice, plus 2 additional calls.
Attachments
Attachments
Issue Links
- duplicates
-
CASSANDRA-15452 Improve disk access patterns during compaction (big format)
- Review In Progress