Cassandra
  1. Cassandra
  2. CASSANDRA-2006

Serverwide caps on memtable thresholds

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.8 beta 1
    • Component/s: Core
    • Labels:
      None

      Description

      By storing global operation and throughput thresholds, we could eliminate the "many small memtables" problem caused by having many CFs. The global threshold would be set in the config file, to allow different classes of servers to have different values configured.

      Operations occurring in the memtable would add to the global counters, in addition to the memtable-local counters. When a global threshold was violated, the memtable in the system that was using the largest fraction of it's local threshold would be flushed. Local thresholds would continue to act as they always have.

      The result would be larger sstables, safer operation with multiple CFs and per node tuning.

      1. 2006.txt
        22 kB
        Jonathan Ellis
      2. 2006-v2.txt
        26 kB
        Jonathan Ellis
      3. 2006-v3.txt
        27 kB
        Jonathan Ellis
      4. jamm-0.2.jar
        6 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Cassandra #838 (See https://hudson.apache.org/hudson/job/Cassandra/838/)

          Show
          Hudson added a comment - Integrated in Cassandra #838 (See https://hudson.apache.org/hudson/job/Cassandra/838/ )
          Hide
          Jonathan Ellis added a comment -

          committed.

          i think we want to keep flush_largest_memtables_at; since it measures directly from the GC, it's a good complement.

          agreed re deprecating per-CF settings, but let's see how this shakes out w/ more testing first.

          Show
          Jonathan Ellis added a comment - committed. i think we want to keep flush_largest_memtables_at; since it measures directly from the GC, it's a good complement. agreed re deprecating per-CF settings, but let's see how this shakes out w/ more testing first.
          Hide
          Stu Hood added a comment -

          +1
          I'm very excited about this change: my last nitpick is that the flush_largest_memtables_at and memtable_total_space_in_mb settings could be made more consistent. At the absolute minimum, they should refer to one another in the config file, but I'm wondering how we might unify the 3 or 4 different reasons for flushing in our monitoring/logging somehow.

          Also, we should start making a plan to deprecate the per-cf settings, or convert them into fractions as mentioned above.

          Show
          Stu Hood added a comment - +1 I'm very excited about this change: my last nitpick is that the flush_largest_memtables_at and memtable_total_space_in_mb settings could be made more consistent. At the absolute minimum, they should refer to one another in the config file, but I'm wondering how we might unify the 3 or 4 different reasons for flushing in our monitoring/logging somehow. Also, we should start making a plan to deprecate the per-cf settings, or convert them into fractions as mentioned above.
          Hide
          Jonathan Ellis added a comment -

          v3 adds warning messages if computed ratio falls outside of 1 < r < 64.

          Show
          Jonathan Ellis added a comment - v3 adds warning messages if computed ratio falls outside of 1 < r < 64.
          Hide
          Jonathan Ellis added a comment -

          The count can take minutes, on a large (GB) memtable under load. We probably don't want to block flush that long.

          Show
          Jonathan Ellis added a comment - The count can take minutes, on a large (GB) memtable under load. We probably don't want to block flush that long.
          Hide
          Stu Hood added a comment -

          Since the only one consuming the getLiveSize value is the MeteredFlusher thread, could the live-size update be moved into that thread instead? Adding a new executor and 2 new tasks seems like overkill, and it looks like it would remove the "potentially a flushed memtable being counted by jamm" fuzziness.

          Show
          Stu Hood added a comment - Since the only one consuming the getLiveSize value is the MeteredFlusher thread, could the live-size update be moved into that thread instead? Adding a new executor and 2 new tasks seems like overkill, and it looks like it would remove the "potentially a flushed memtable being counted by jamm" fuzziness.
          Hide
          Jonathan Ellis added a comment -

          v2 moves to trunk, adds a 25% fudge factor, and adds MeteredFlusherTest to demonstrate handling 100 CFs in a workload that OOMs w/o this feature. (Note that I disabled emergency GC-based flushing in the test config, so that doesn't cause a false negative.)

          Show
          Jonathan Ellis added a comment - v2 moves to trunk, adds a 25% fudge factor, and adds MeteredFlusherTest to demonstrate handling 100 CFs in a workload that OOMs w/o this feature. (Note that I disabled emergency GC-based flushing in the test config, so that doesn't cause a false negative.)
          Hide
          Jonathan Ellis added a comment -
          Show
          Jonathan Ellis added a comment - Requires JAMM (see https://github.com/jbellis/jamm/ and CASSANDRA-2203 )
          Hide
          Jonathan Ellis added a comment -

          Patch that optionally creates a global heap usage threshold and tries to keep total memtable size under that.

          The two main points of interest are Memtable.updateLiveRatio and MeteredFlusher.

          MeteredFlusher is what checks memory usage (once per second) and kicks of the flushes. Note that naively flushing when we hit the threshold is wrong, since you can have multiple memtables in-flight during the flush process. To address this, we track inactive but unflushed memtables and include those in our total. We also aggressively flush any memtable that reaches the level of "if my entire flush pipeline were full of memtables of this size, how big could I allow them to be."

          Since counting each object's size is far too slow to be useful directly, we compute the ratio of serialized size to memory size in the background, and update that periodically; That is what updateLiveRatio does. MeteredFlusher then bases its work on actual serialized size, multiplied by this ratio.

          One last note: the config code is a little messy because we want to leave behavior unchanged (i.e.: only use old per-CF thresholds) if the setting is absent as it would be for an upgrader. But, we want a setting to allow "pick a reasonable default based on heap usage;" hence the distinction b/t null and -1 (autocompute).

          I tested by creating the stress schema, then modifying the per-CF settings to be multiple TB, so only the new global flusher affects things. Then I created half a GB of commitlog files to reply – CL replay hammers it much harder than even stress.java.

          It was successful in preventing OOM (or even the "emergency flushing" at 85% of heap) but heap usage as reported by CMS was consistently about 25% higher than what MeteredFlusher thought it should be. It may be that we can fudge factor this; otherwise, tuning by watching CMS vs estimated size and adjusting the setting manually to compensate, is still much easier than the status quo of per-CF tuning.

          To experiment, I recommend also patching the log4j settings as follows:

          Index: conf/log4j-server.properties
          ===================================================================
          --- conf/log4j-server.properties	(revision 1085010)
          +++ conf/log4j-server.properties	(working copy)
          @@ -35,7 +35,8 @@
           log4j.appender.R.File=/var/log/cassandra/system.log
           
           # Application logging options
          -#log4j.logger.org.apache.cassandra=DEBUG
          +log4j.logger.org.apache.cassandra.service.GCInspector=DEBUG
          +log4j.logger.org.apache.cassandra.db.MeteredFlusher=DEBUG
           #log4j.logger.org.apache.cassandra.db=DEBUG
           #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
          
          Show
          Jonathan Ellis added a comment - Patch that optionally creates a global heap usage threshold and tries to keep total memtable size under that. The two main points of interest are Memtable.updateLiveRatio and MeteredFlusher. MeteredFlusher is what checks memory usage (once per second) and kicks of the flushes. Note that naively flushing when we hit the threshold is wrong, since you can have multiple memtables in-flight during the flush process. To address this, we track inactive but unflushed memtables and include those in our total. We also aggressively flush any memtable that reaches the level of "if my entire flush pipeline were full of memtables of this size, how big could I allow them to be." Since counting each object's size is far too slow to be useful directly, we compute the ratio of serialized size to memory size in the background, and update that periodically; That is what updateLiveRatio does. MeteredFlusher then bases its work on actual serialized size, multiplied by this ratio. One last note: the config code is a little messy because we want to leave behavior unchanged (i.e.: only use old per-CF thresholds) if the setting is absent as it would be for an upgrader. But, we want a setting to allow "pick a reasonable default based on heap usage;" hence the distinction b/t null and -1 (autocompute). I tested by creating the stress schema, then modifying the per-CF settings to be multiple TB, so only the new global flusher affects things. Then I created half a GB of commitlog files to reply – CL replay hammers it much harder than even stress.java. It was successful in preventing OOM (or even the "emergency flushing" at 85% of heap) but heap usage as reported by CMS was consistently about 25% higher than what MeteredFlusher thought it should be. It may be that we can fudge factor this; otherwise, tuning by watching CMS vs estimated size and adjusting the setting manually to compensate, is still much easier than the status quo of per-CF tuning. To experiment, I recommend also patching the log4j settings as follows: Index: conf/log4j-server.properties =================================================================== --- conf/log4j-server.properties (revision 1085010) +++ conf/log4j-server.properties (working copy) @@ -35,7 +35,8 @@ log4j.appender.R.File=/var/log/cassandra/system.log # Application logging options -#log4j.logger.org.apache.cassandra=DEBUG +log4j.logger.org.apache.cassandra.service.GCInspector=DEBUG +log4j.logger.org.apache.cassandra.db.MeteredFlusher=DEBUG #log4j.logger.org.apache.cassandra.db=DEBUG #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
          Hide
          Stu Hood added a comment -

          HBase accomplishes this by keeping a threshold of heap usage for Memtables, and flushing the largest when the threshold is crossed: similar to the safety threshold that jbellis added recently.

          Show
          Stu Hood added a comment - HBase accomplishes this by keeping a threshold of heap usage for Memtables, and flushing the largest when the threshold is crossed: similar to the safety threshold that jbellis added recently.
          Hide
          David Boxenhorn added a comment -

          I guess that what my suggestion means, in practice, is that "the memtable in the system that was using the largest fraction of it's local threshold would be flushed" would be applied when a keyspace threshold is exceeded, rather than when a system threshold is exceeded.

          When a server threshold is exceeded, you would first look for the keyspace that is using the largest fraction of its threshold, then flush the memtable in that keyspace that is using the largest fraction of its local threshold.

          Show
          David Boxenhorn added a comment - I guess that what my suggestion means, in practice, is that "the memtable in the system that was using the largest fraction of it's local threshold would be flushed" would be applied when a keyspace threshold is exceeded, rather than when a system threshold is exceeded. When a server threshold is exceeded, you would first look for the keyspace that is using the largest fraction of its threshold, then flush the memtable in that keyspace that is using the largest fraction of its local threshold.
          Hide
          David Boxenhorn added a comment -

          The way I see it - and it becomes even more necessary for a multi-tenant configuration - there should be three completely separate levels of configuration.

          • Column family configuration is based on data and usage characteristics of column families in your application.
          • Keyspace configuration enables the modularization of your application (and in a multi-tenant environment, you can assign a keyspace to a tenant)
          • Server configuration is based on the specific hardware limitations of the server.

          Server configuration takes priority over keyspace configuration which takes priority over application configuration.

          Looking at if from the inverse perspective:

          • CF configuration tunes access to your CFs
          • Keyspace configuration protects one module of your application from problems in the other modules
          • Server configuration protects the server as a whole from going beyond the limits of its hardware
          Show
          David Boxenhorn added a comment - The way I see it - and it becomes even more necessary for a multi-tenant configuration - there should be three completely separate levels of configuration. Column family configuration is based on data and usage characteristics of column families in your application. Keyspace configuration enables the modularization of your application (and in a multi-tenant environment, you can assign a keyspace to a tenant) Server configuration is based on the specific hardware limitations of the server. Server configuration takes priority over keyspace configuration which takes priority over application configuration. Looking at if from the inverse perspective: CF configuration tunes access to your CFs Keyspace configuration protects one module of your application from problems in the other modules Server configuration protects the server as a whole from going beyond the limits of its hardware

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Stu Hood
              Reviewer:
              Stu Hood
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development