Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-7139

Default concurrent_compactors is probably too high

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 2.1 rc1
    • Component/s: None
    • Labels:
      None

      Description

      The default number of concurrent compactors is probably too high for modern hardware with spinning disks for storage: A modern blade can easily have 24+ Cores, which would result in a default of 24 concurrent compactions. This not only increases random IO, it also keeps around a lot of obsoleted files for an unnecessarily long time, as each compaction keeps references to any possibly overlapping files that it isn't itself compacting - but these can have been obsoleted part way through by compactions that finished earlier. If you factor in the default compaction throughput rate of 16Mb/s, anything but a single default concurrent_compactor makes very little sense, as a single thread should always be able to handle 16Mb/s, will cause less interference with other processes, and permits obsoleted files to be immediately removed.

      See http://imgur.com/HDqhxFp for a graph demonstrating the result of making this change on a box with 24-cores and 8Tb of storage (first spike is default settings)

      1. 7139.txt
        2 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          That's a graph of ... something vs time?

          Show
          jbellis Jonathan Ellis added a comment - That's a graph of ... something vs time?
          Hide
          benedict Benedict added a comment -

          disk (space) utilisation vs time

          Show
          benedict Benedict added a comment - disk (space) utilisation vs time
          Hide
          jbellis Jonathan Ellis added a comment -

          so first spike is defaults, what are the other seven?

          Show
          jbellis Jonathan Ellis added a comment - so first spike is defaults, what are the other seven?
          Hide
          benedict Benedict added a comment -

          They're flush/compaction spikes during operation with only one concurrent_compactor. i.e. their disk space was exploding prior to changing, and they were having to bounce nodes daily to reclaim disk space - the graph only goes back as far as just before changing the config option.

          Show
          benedict Benedict added a comment - They're flush/compaction spikes during operation with only one concurrent_compactor. i.e. their disk space was exploding prior to changing, and they were having to bounce nodes daily to reclaim disk space - the graph only goes back as far as just before changing the config option.
          Hide
          jbellis Jonathan Ellis added a comment -

          One could be reasonable with SSD + unlimited compaction throughput, especially with LCS. But on HDD + STCS [still the default] getting compactions "piled up" behind a huge compaction op is a real thing.

          How about one per disk, instead of one per core?

          Show
          jbellis Jonathan Ellis added a comment - One could be reasonable with SSD + unlimited compaction throughput, especially with LCS. But on HDD + STCS [still the default] getting compactions "piled up" behind a huge compaction op is a real thing. How about one per disk, instead of one per core?
          Hide
          benedict Benedict added a comment -

          How about: 1 per disk, with a cap of 8, say? Boxes with 12+ (even 24+) disks aren't totally uncommon and you could see the same problem there as well.

          This should all be less of a problem with CASSANDRA-6696 as we'll be able to actually schedule on a per-disk basis and have no risk of referring to files on other disks, so we just want a sensible number to avoid breaking anyone who hasn't tuned their nodes between now and then.

          Show
          benedict Benedict added a comment - How about: 1 per disk, with a cap of 8, say? Boxes with 12+ (even 24+) disks aren't totally uncommon and you could see the same problem there as well. This should all be less of a problem with CASSANDRA-6696 as we'll be able to actually schedule on a per-disk basis and have no risk of referring to files on other disks, so we just want a sensible number to avoid breaking anyone who hasn't tuned their nodes between now and then.
          Hide
          jbellis Jonathan Ellis added a comment -

          SGTM.

          Show
          jbellis Jonathan Ellis added a comment - SGTM.
          Hide
          jbellis Jonathan Ellis added a comment -

          Attached.

          Show
          jbellis Jonathan Ellis added a comment - Attached.
          Hide
          benedict Benedict added a comment -

          LGTM, +1

          Show
          benedict Benedict added a comment - LGTM, +1
          Hide
          jbellis Jonathan Ellis added a comment -

          committed

          Show
          jbellis Jonathan Ellis added a comment - committed
          Hide
          jjordan Jeremiah Jordan added a comment -

          Can we get this change in 2.0? Have had the default concurrent compactors causes issues on a few clusters.

          Show
          jjordan Jeremiah Jordan added a comment - Can we get this change in 2.0? Have had the default concurrent compactors causes issues on a few clusters.
          Hide
          jbellis Jonathan Ellis added a comment -

          I don't like changing defaults out from under people mid-release. Makes for an unpleasant surprise if those defaults were working for you.

          Show
          jbellis Jonathan Ellis added a comment - I don't like changing defaults out from under people mid-release. Makes for an unpleasant surprise if those defaults were working for you.
          Hide
          Antauri Catalin Alexandru Zamfir added a comment -

          Our set-up was RAID5 and the min (numberOfDisk, numberOfCores) would just be 2, when we have 40+ cores. The commented "concurrent_compactors" would be "2" meaning that a lot of SSTables are accumulating in high-cardinality tables (where the partition key is an UUID-type) because the compaction is limited to "2". Looking at "dstat" even if we've set compaction_throughput_in_mb_per_sec to 192 (spinning disk) the dstat -lrv1 disk write maxes out at 10MB/s.

          IMHO, the concurrent_compactors should be number_of_cores/compaction_throughput_in_mb_per_sec * 100 which in our case (40 cores) gives around 20/21 compactors. And on 8 cores (8/192 * 100 gives 4 concurrent compactors).

          Show
          Antauri Catalin Alexandru Zamfir added a comment - Our set-up was RAID5 and the min (numberOfDisk, numberOfCores) would just be 2, when we have 40+ cores. The commented "concurrent_compactors" would be "2" meaning that a lot of SSTables are accumulating in high-cardinality tables (where the partition key is an UUID-type) because the compaction is limited to "2". Looking at "dstat" even if we've set compaction_throughput_in_mb_per_sec to 192 (spinning disk) the dstat -lrv1 disk write maxes out at 10MB/s. IMHO, the concurrent_compactors should be number_of_cores/compaction_throughput_in_mb_per_sec * 100 which in our case (40 cores) gives around 20/21 compactors. And on 8 cores (8/192 * 100 gives 4 concurrent compactors).
          Hide
          benedict Benedict added a comment -

          This is only the default. You are recommended to tune this default based on your own system's behaviour. With modern SSDs and many cores, many concurrent compactors is a great idea. For spinning disk setups, it can be terrible, and we want to avoid terrible default decisions.

          Either way, I suspect the problem you are encountering is entirely different, i.e. that the default compaction_throughput_mb_per_sec is 10, which would be why you are maxing out at exactly 10MB/s.

          Show
          benedict Benedict added a comment - This is only the default. You are recommended to tune this default based on your own system's behaviour. With modern SSDs and many cores, many concurrent compactors is a great idea. For spinning disk setups, it can be terrible, and we want to avoid terrible default decisions. Either way, I suspect the problem you are encountering is entirely different, i.e. that the default compaction_throughput_mb_per_sec is 10, which would be why you are maxing out at exactly 10MB/s.

            People

            • Assignee:
              jbellis Jonathan Ellis
              Reporter:
              benedict Benedict
              Reviewer:
              Benedict
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development