Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.11
    • Component/s: API, Core
    • Labels:
      None
    • Environment:

      Description

      We are doing large volume insert/update tests on a CASS via CQL3.

      Using 4GB heap, after roughly 750,000 updates create/update 75,000 row keys, we run out of heap, and it never dissipates, and we begin getting this infamous error which many people seem to be encountering:

      WARN [ScheduledTasks:1] 2013-09-26 16:17:10,752 GCInspector.java (line 142) Heap is 0.9383457210434385 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
      INFO [ScheduledTasks:1] 2013-09-26 16:17:10,753 StorageService.java (line 3614) Unable to reduce heap usage since there are no dirty column families

      8 and 12 GB heaps appear to delay the problem by roughly proportionate amounts of 75,000 - 100,000 rowkeys per 4GB. Each run of 50,000 row key creations sees the heap grow and never shrink again.

      We have attempted to no effect:

      • removing all secondary indexes to see if that alleviates overuse of bloom filters
      • adjusted parameters for compaction throughput
      • adjusted memtable flush thresholds and other parameters

      By examining heapdumps, it seems apparent that the problem is perpetual retention of CQL3 BATCH statements. We have even tried dropping the keyspaces after the updates and the CQL3 statement are still visible in the heapdump, and after many many many CMS GC runs. G1 also showed this issue.

      The 750,000 statements are broken into batches of roughly 200 statements.

      1. 6107.patch
        2 kB
        Lyuben Todorov
      2. 6107_v2.patch
        4 kB
        Lyuben Todorov
      3. Screen Shot 2013-10-03 at 17.59.37.png
        77 kB
        Lyuben Todorov
      4. 6107_v3.patch
        7 kB
        Lyuben Todorov
      5. 6107-v4.txt
        5 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Note: this was reverted in 1.2.14 because of CASSANDRA-6592.

          Show
          Jonathan Ellis added a comment - Note: this was reverted in 1.2.14 because of CASSANDRA-6592 .
          Hide
          Michael Oczkowski added a comment -

          This change appears to break code that uses EmbeddedCassandraService and PreparedStatements in version 1.2.11. Please see CASSANDRA-6293 for details.

          Show
          Michael Oczkowski added a comment - This change appears to break code that uses EmbeddedCassandraService and PreparedStatements in version 1.2.11. Please see CASSANDRA-6293 for details.
          Hide
          Sylvain Lebresne added a comment -

          CASSANDRA-5981 is indeed about limiting the size at the protocol level. However it's a global frame limitation. In particular this is the hard limit for queries with their values and for that reason the current hard-coded limit is relatively high (256MB). And we can bikeshed on the exact default to user and CASSANDRA-5981 will probably allow the user to play with that limit, but in any case, it will definitively have to be higher than the 1MB. The other detail is that the limit done by CASSANDRA-5981 is on the sent bytes, not the in-memory size of the query, but that probably don't matter much.

          Anyway, provided that a prepared statement doesn't include values, it wouldn't be absurd to have a specific, lower limit on their size. Though my own preference would be to just leave it to a global limit on the preparedStatements cache map (but it could make sense to reject statements that blow up the entire limit on their own, so as to make sure to respect it). Too many hard-coded limitations make me nervous.

          Show
          Sylvain Lebresne added a comment - CASSANDRA-5981 is indeed about limiting the size at the protocol level. However it's a global frame limitation. In particular this is the hard limit for queries with their values and for that reason the current hard-coded limit is relatively high (256MB). And we can bikeshed on the exact default to user and CASSANDRA-5981 will probably allow the user to play with that limit, but in any case, it will definitively have to be higher than the 1MB. The other detail is that the limit done by CASSANDRA-5981 is on the sent bytes, not the in-memory size of the query, but that probably don't matter much. Anyway, provided that a prepared statement doesn't include values, it wouldn't be absurd to have a specific, lower limit on their size. Though my own preference would be to just leave it to a global limit on the preparedStatements cache map (but it could make sense to reject statements that blow up the entire limit on their own, so as to make sure to respect it). Too many hard-coded limitations make me nervous.
          Hide
          Jonathan Ellis added a comment -

          commented

          Show
          Jonathan Ellis added a comment - commented
          Hide
          Jonathan Ellis added a comment -

          I'm not sure about the rejection of large statements at protocol level.

          I think that's what CASSANDRA-5981 is open for, actually.

          Show
          Jonathan Ellis added a comment - I'm not sure about the rejection of large statements at protocol level. I think that's what CASSANDRA-5981 is open for, actually.
          Hide
          Lyuben Todorov added a comment -

          LGTM. But i was able to build some pretty big batch statements ( >4MB ) so I'm not sure about the rejection of large statements at protocol level.

          Show
          Lyuben Todorov added a comment - LGTM. But i was able to build some pretty big batch statements ( >4MB ) so I'm not sure about the rejection of large statements at protocol level.
          Hide
          Jonathan Ellis added a comment -

          On second thought, rejecting really huge statements should be done at the protocol level. I'll follow up with Sylvain to see if we're already doing that.

          v4 attached that just does the weighing as discussed. WDYT?

          Show
          Jonathan Ellis added a comment - On second thought, rejecting really huge statements should be done at the protocol level. I'll follow up with Sylvain to see if we're already doing that. v4 attached that just does the weighing as discussed. WDYT?
          Hide
          Lyuben Todorov added a comment -

          Set the total cache for the statements to 1/256 of the heap (for thrift and cql), set the individual statement limit to 1MB. If a statement is less than 1MB but too large for the cache, it's executed but not stored in the cache.

          Show
          Lyuben Todorov added a comment - Set the total cache for the statements to 1/256 of the heap (for thrift and cql), set the individual statement limit to 1MB. If a statement is less than 1MB but too large for the cache, it's executed but not stored in the cache.
          Hide
          Lyuben Todorov added a comment -

          Memory Usage graph for batches of 'insert' statements with varying numbers of columns.

          Show
          Lyuben Todorov added a comment - Memory Usage graph for batches of 'insert' statements with varying numbers of columns.
          Hide
          Jonathan Ellis added a comment -

          Use the cache weigher/weightedCapacity api instead of re-measuring the entire cache each time. then the cache will take care of evicting old ones to make room as needed.

          suggest making the capacity 1/256 of heap size.

          should probably have a separate setting for maximum single statement size. if a single statement is under this threshold but larger than the cache, execute it but do not cache it.

          finally, statementid size should be negligible, i'd leave that out.

          Show
          Jonathan Ellis added a comment - Use the cache weigher/weightedCapacity api instead of re-measuring the entire cache each time. then the cache will take care of evicting old ones to make room as needed. suggest making the capacity 1/256 of heap size. should probably have a separate setting for maximum single statement size. if a single statement is under this threshold but larger than the cache, execute it but do not cache it. finally, statementid size should be negligible, i'd leave that out.
          Hide
          Lyuben Todorov added a comment -

          Changed the MemoryMeter to measure the full map rather than enforcing restrictions on individual prepared statements. The hardcoded maximum is 100MB for the thrift cache and 100MB for the CQL cache.

          Show
          Lyuben Todorov added a comment - Changed the MemoryMeter to measure the full map rather than enforcing restrictions on individual prepared statements. The hardcoded maximum is 100MB for the thrift cache and 100MB for the CQL cache.
          Hide
          Jonathan Ellis added a comment -

          Right. Use the size you're calculating as the weight in the cache Map.

          Show
          Jonathan Ellis added a comment - Right. Use the size you're calculating as the weight in the cache Map.
          Hide
          Aleksey Yeschenko added a comment -

          I don't think the issue here is (just) large individual prepared statements. It's the total size that all the prepared statements are occupying. That's what should be tracked and limited, not just the individual ones.

          Show
          Aleksey Yeschenko added a comment - I don't think the issue here is (just) large individual prepared statements. It's the total size that all the prepared statements are occupying. That's what should be tracked and limited, not just the individual ones.
          Hide
          Lyuben Todorov added a comment -

          Added a MemoryMeter to QueryProcessor#getStatement to track the size of each ParsedStatement that is Prepared. If the statement is more than 1MB (1048576 bytes) than an IllegalArgumentsException is thrown. Also added the size of the prepared statement in the tracing line inside of QueryProcessor#getStatement.

          Show
          Lyuben Todorov added a comment - Added a MemoryMeter to QueryProcessor#getStatement to track the size of each ParsedStatement that is Prepared. If the statement is more than 1MB (1048576 bytes) than an IllegalArgumentsException is thrown. Also added the size of the prepared statement in the tracing line inside of QueryProcessor#getStatement.
          Hide
          Constance Eustace added a comment -

          Yep, removing statement preparation looks good! Heap is GC'ing, and multiple runs can be done.

          Show
          Constance Eustace added a comment - Yep, removing statement preparation looks good! Heap is GC'ing, and multiple runs can be done.
          Hide
          Jonathan Ellis added a comment -

          That sounds reasonable.

          I'd actually just pin it as something pretty small like 1MB; with CASSANDRA-4693 we shouldn't need to prepare monstrous batches.

          Show
          Jonathan Ellis added a comment - That sounds reasonable. I'd actually just pin it as something pretty small like 1MB; with CASSANDRA-4693 we shouldn't need to prepare monstrous batches.
          Hide
          Sylvain Lebresne added a comment -

          Given that prepared is not performance sensitive, I suppose we could even use jmeter to get the precise in-memory size, and then cap the prepared statements to some percentage of the heap.

          Show
          Sylvain Lebresne added a comment - Given that prepared is not performance sensitive, I suppose we could even use jmeter to get the precise in-memory size, and then cap the prepared statements to some percentage of the heap.
          Hide
          Jonathan Ellis added a comment -

          Could we fix the OOM by adding a weight to the Map entry instead of assuming all entries are equal?

          Show
          Jonathan Ellis added a comment - Could we fix the OOM by adding a weight to the Map entry instead of assuming all entries are equal?
          Hide
          Constance Eustace added a comment -

          We're using SpringJDBC on top of the cass-jdbc driver. In order to intercept the update and specify consistency, that is only convenient with a PreparedStatementCreator...

          so we will not use SpringJDBC/PreparedStatementCreator and instead do a more manual JDBC call...

          sorry for the "critical" spam...

          Show
          Constance Eustace added a comment - We're using SpringJDBC on top of the cass-jdbc driver. In order to intercept the update and specify consistency, that is only convenient with a PreparedStatementCreator... so we will not use SpringJDBC/PreparedStatementCreator and instead do a more manual JDBC call... sorry for the "critical" spam...
          Hide
          Constance Eustace added a comment -

          I agree, there is client dysfunction here... we're going to stop prepping the statements, if possible (I think the cass jdbc project may have required prepping to set the consistency level, which sucks, but let me verify).

          Show
          Constance Eustace added a comment - I agree, there is client dysfunction here... we're going to stop prepping the statements, if possible (I think the cass jdbc project may have required prepping to set the consistency level, which sucks, but let me verify).
          Hide
          Constance Eustace added a comment -

          It appears you are using a MAX_CACHE_PREPARED of 100,000, and COncurrentLinkedHashMap should use that as an evictor.

          If the individual keys for 200 line batch statements are large (say, 10k, which I think based on the heap dump they consist of 1 map per statement in the batch possibly, so that is easily possible). So 100000 x 100000 bytes per statement = 10 gigabytes... uhoh.

          I think 600,000 updates, which are 3000 batches of 200 statements each popped the heap for a 4GB. I figure 1 GB of that heap is used for filters/sstables/memtables/etc, so 3000 batches popped 3GB of heap, so a megabyte per batch.

          Can we expose the MAX_CACHE_PREPARED as a config parameter?

          Show
          Constance Eustace added a comment - It appears you are using a MAX_CACHE_PREPARED of 100,000, and COncurrentLinkedHashMap should use that as an evictor. If the individual keys for 200 line batch statements are large (say, 10k, which I think based on the heap dump they consist of 1 map per statement in the batch possibly, so that is easily possible). So 100000 x 100000 bytes per statement = 10 gigabytes... uhoh. I think 600,000 updates, which are 3000 batches of 200 statements each popped the heap for a 4GB. I figure 1 GB of that heap is used for filters/sstables/memtables/etc, so 3000 batches popped 3GB of heap, so a megabyte per batch. Can we expose the MAX_CACHE_PREPARED as a config parameter?
          Hide
          Sylvain Lebresne added a comment -

          but shouldn't there be a limit on stored prep statements with LRU eviction?

          There is (from QueryProcessor.java):

              public static final int MAX_CACHE_PREPARED = 100000; // Enough to keep buggy clients from OOM'ing us
              private static final Map<MD5Digest, CQLStatement> preparedStatements = new ConcurrentLinkedHashMap.Builder<MD5Digest, CQLStatement>()
                                                                                         .maximumWeightedCapacity(MAX_CACHE_PREPARED)
                                                                                         .build();
          

          but it's possible that this limit was too high to prevent you from OOM'ing in that case. And maybe that hard-coded is too high

          But really, you should not prepare every single statement, it is a client error.

          Show
          Sylvain Lebresne added a comment - but shouldn't there be a limit on stored prep statements with LRU eviction? There is (from QueryProcessor.java): public static final int MAX_CACHE_PREPARED = 100000; // Enough to keep buggy clients from OOM'ing us private static final Map<MD5Digest, CQLStatement> preparedStatements = new ConcurrentLinkedHashMap.Builder<MD5Digest, CQLStatement>() .maximumWeightedCapacity(MAX_CACHE_PREPARED) .build(); but it's possible that this limit was too high to prevent you from OOM'ing in that case. And maybe that hard-coded is too high But really, you should not prepare every single statement, it is a client error.
          Hide
          Constance Eustace added a comment -

          It appears that since we are sending preparedStatements (this allows us to prep the statement and then set the consistency level), that the preparedStatements are never evicted from the prepared statement cache in org.apache.cassandra.cql3.QueryProcessor

          There are no removes ever done to preparedStatements or thriftPreparedStatements...

          This may technically be our fault for preparing every single batch statement, but shouldn't there be a limit on stored prep statements with LRU eviction?

          Show
          Constance Eustace added a comment - It appears that since we are sending preparedStatements (this allows us to prep the statement and then set the consistency level), that the preparedStatements are never evicted from the prepared statement cache in org.apache.cassandra.cql3.QueryProcessor There are no removes ever done to preparedStatements or thriftPreparedStatements... This may technically be our fault for preparing every single batch statement, but shouldn't there be a limit on stored prep statements with LRU eviction?
          Hide
          Constance Eustace added a comment -
          • Further examination of several point-in-time heap dumps show that ALL cql statement batches are retained in the heap. Each statement has multiple collections such as ConcurrentHashMap and other data structures which will obviously consume huge amounts of resources.
          • We have run a smaller run that does NOT batch our updates. It is obviously much slower, but the heap dumps show over time objects being garbage collected propertly.
          Show
          Constance Eustace added a comment - Further examination of several point-in-time heap dumps show that ALL cql statement batches are retained in the heap. Each statement has multiple collections such as ConcurrentHashMap and other data structures which will obviously consume huge amounts of resources. We have run a smaller run that does NOT batch our updates. It is obviously much slower, but the heap dumps show over time objects being garbage collected propertly.

            People

            • Assignee:
              Lyuben Todorov
              Reporter:
              Constance Eustace
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development