Details
-
Improvement
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
None
Description
In C* 3.0 we started to use incremental repair by default. However, this seems to create a repair performance problem if you have a relatively write-heavy workload that can drive all available concurrent_compactors to be used by active compactions.
I was able to demonstrate this issue by the following scenario:
1. On a three-node C* 3.0.7 cluster, use "cassandra-stress write n=100000000" to generate 100GB of data with keyspace1.standard1 table using LCS (ctrl+c the stress client once the data size on each node reaches 35+GB).
2. At this point, there will be hundreds of L0 SSTables waiting for LCS to digest on each node, and with concurrent_compactors set to default at 2, the two compaction threads are constantly busy processing the backlogged L0 SSTables.
3. Now create a new keyspace called "trivial_ks" with RF=3 and create a small two-column CQL table in it, and insert 6 records.
4. Start a "nodetool repair trivial_ks" session on one of the nodes, and watch the following behavior:
automaton@wdengdse50google-98425b985-3:~$ nodetool repair trivial_ks [2016-07-13 01:57:28,364] Starting repair command #1, repairing keyspace trivial_ks with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) [2016-07-13 01:57:31,027] Repair session 27212dd0-489d-11e6-a6d6-cd06faa0aaa2 for range [(3074457345618258602,-9223372036854775808], (-9223372036854775808,-3074457345618258603], (-3074457345618258603,3074457345618258602]] finished (progress: 66%) [2016-07-13 02:07:47,637] Repair completed successfully [2016-07-13 02:07:47,657] Repair command #1 finished in 10 minutes 19 seconds
Basically for such a small table it took 10+ minutes to finish the repair. Looking at debug.log for this particular repair session UUID, you will find that all nodes were able to pass through validation compaction within 15ms, but one of the nodes actually got stuck waiting for a compaction slot because it has to do an anti-compaction step before it can finally tell the initiating node that it's done with its part of the repair session, so it took 10+ minutes for one compaction slot to be freed up, like shown in the following debug.log entries:
DEBUG [AntiEntropyStage:1] 2016-07-13 01:57:30,956 RepairMessageVerbHandler.java:149 - Got anticompaction request AnticompactionRequest{parentRepairSession=27103de0-489d-11e6-a6d6-cd06faa0aaa2} org.apache.cassandra.repair.messages.AnticompactionRequest@34449ff4 <...> <snip> <...> DEBUG [CompactionExecutor:5] 2016-07-13 02:07:47,506 CompactionTask.java:217 - Compacted (286609e0-489d-11e6-9e03-1fd69c5ec46c) 32 sstables to [/var/lib/cassandra/data/keyspace1/standard1-9c02e9c1487c11e6b9161dbd340a212f/mb-499-big,] to level=0. 2,892,058,050 bytes to 2,874,333,820 (~99% of original) in 616,880ms = 4.443617MB/s. 0 total partitions merged to 12,233,340. Partition merge counts were {1:12086760, 2:146580, } INFO [CompactionExecutor:5] 2016-07-13 02:07:47,512 CompactionManager.java:511 - Starting anticompaction for trivial_ks.weitest on 1/[BigTableReader(path='/var/lib/cassandra/data/trivial_ks/weitest-538b07d1489b11e6a9ef61c6ff848952/mb-1-big-Data.db')] sstables INFO [CompactionExecutor:5] 2016-07-13 02:07:47,513 CompactionManager.java:540 - SSTable BigTableReader(path='/var/lib/cassandra/data/trivial_ks/weitest-538b07d1489b11e6a9ef61c6ff848952/mb-1-big-Data.db') fully contained in range (-9223372036854775808,-9223372036854775808], mutating repairedAt instead of anticompacting INFO [CompactionExecutor:5] 2016-07-13 02:07:47,570 CompactionManager.java:578 - Completed anticompaction successfully
Since validation compaction has its own threads outside of the regular compaction thread pool restricted by concurrent_compactors, we were able to pass through validation compaction without any issue. If we could treat anti-compaction the same way (i.e. to give it its own thread pool), we can avoid this kind of repair performance problem from happening.
Attachments
Issue Links
- is related to
-
CASSANDRA-11218 Prioritize Secondary Index rebuild
- Awaiting Feedback