Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12615

Improve LCS compaction concurrency during L0->L1 compaction

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Normal
    • Resolution: Unresolved
    • None
    • Local/Compaction

    Description

      I've done multiple experiments with compaction-stress at 100GB, 200GB, 400GB and 600GB levels. These scenarios share a common pattern: at the beginning of the compaction they all overwhelm L0 with a lot (hundreds to thousands) of 128MB SSTables. One common observation I noticed from visualizing the compaction.log files from these tests is that initially after some massive STCS-in-L0 activities (could take up to 40% of total compaction time), L0->L1 always takes a really long time which frequently involves all of the bigger L0 SSTables (the results of many STCS compactions earlier) and all of the 10 L1 SSTables, and the output covers almost the full data set. Since L0->L1 can only happen single-threaded, we often spend close to 40% of the total compaction time in this L0->L1 stage, and only after this first really long L0->L1 finishes and 100s or 1000s of SSTables land on L1, can concurrent compactions at higher levels resume (to move the thousands of L1 SSTables to higher levels). The attached snapshot demonstrates this observation.

      The question is, if this L0->L1 compaction is so big and can only happen single-threaded, and ends up generating thousands of L1 SSTables, most of which will have to up-level later anyway (as L1 can accommodate at most 10 SSTables), why not start that L1+ up-level earlier, i.e. before this L0->L1 compaction finishes.

      I can think of a few approaches: 1) break L0->L1 into smaller chunks if it realizes that the output of such L0->L1 compaction is going to far exceed the capacity of L1, this will allow each L0->L1 to finish sooner, and have the resulting L1 SSTables to be able to participate in higher up-level activities; 2) still treating the full L0->L1 as one big compaction session, but making the intermediate results (once the number of L1 SSTable output exceeds the L1 capacity) available for higher up-level activities.

      If we can somehow leverage more threads during this massive L0->L1 phase, we can save close to 40% of the total compaction time when L0 is initially backlogged, which will be a great improvement to our LCS compaction throughput.

      Attachments

        1. L0_L1_inefficiency.jpg
          226 kB
          Wei Deng

        Issue Links

          Activity

            People

              Unassigned Unassigned
              weideng Wei Deng
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: