Cassandra
  1. Cassandra
  2. CASSANDRA-5982

OutOfMemoryError when writing text blobs to a very large number of tables

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.10, 2.0.1
    • Component/s: Core
    • Labels:
      None

      Description

      This test goes outside the norm for Cassandra, creating ~2000 column families, and writing large text blobs to them.

      The process goes like this:

      Bring up a 6 node m2.2xlarge cluster on EC2. This instance type has enough memory (34.2GB) so that Cassandra will allocate a full 8GB heap without tuning cassandra-env.sh. However, this instance type only has a single drive, so data and commitlog are comingled. (This test has also been run m1.xlarge instances which have four drives (but lower memory) and has exhibited similar results when assigning one to commitlog and 3 to datafile_directories.)

      Use the 'memtable_allocator: HeapAllocator' setting from CASSANDRA-5935.

      Create 2000 CFs:

      CREATE KEYSPACE cf_stress WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}
      CREATE COLUMNFAMILY cf_stress.tbl_00000 (id timeuuid PRIMARY KEY, val1 text, val2 text, val3 text ) ;
      # repeat for tbl_00001, tbl_00002 ... tbl_02000
      

      This process of creating tables takes a long time, about 5 hours, but for anyone wanting to create that many tables, presumably they only need to do this once, so this may be acceptable.

      Write data:

      The test dataset consists of writing 100K, 1M, and 10M documents to these tables:

      INSERT INTO {table_name} (id, val1, val2, val3) VALUES (?, ?, ?, ?)
      

      With 5 threads doing these inserts across the cluster, indefinitely, randomly choosing a table number 1-2000, the cluster eventually topples over with 'OutOfMemoryError: Java heap space'.

      A heap dump analysis indicates that it's mostly memtables:

      Best current theory is that this is commitlog bound and that the memtables cannot flush fast enough due to locking issues. But I'll let Jonathan Ellis comment more on that.

      1. system.log.gz
        893 kB
        Ryan McGuire
      2. 2000CF_memtable_mem_usage.png
        173 kB
        Ryan McGuire

        Activity

        Hide
        Ryan McGuire added a comment -

        This seems to be the solution:

        With these additional settings, I no longer see any OOM errors on any EC2 instance I've tested (m1.xlarge, m2.2xlarge, hs1.8xlarge)

        Show
        Ryan McGuire added a comment - This seems to be the solution: Use Jonathan Ellis ' cfs10k patch set concurrent_write:8 and memtable_total_space_in_mb:1024 in the yaml. Requires 8GB heap. With these additional settings, I no longer see any OOM errors on any EC2 instance I've tested (m1.xlarge, m2.2xlarge, hs1.8xlarge)
        Hide
        Jonathan Ellis added a comment -

        This workload presents several challenges to Cassandra:

        • MemoryMeter is slow, and we [correctly] limit it to one CFS at a time. If MemoryMeter is busy measuring a large memtable, we can easily write a lot of data to other CFs (or even the same one in a new memtable). So we need to pick a more realistic liveRatio default to start with until we can get the first measurements. (This is not a problem for Ryan's workload here; extremely large blobs have a liveRatio close to 1 anyway. But it could be a problem for a workload with smaller cells.)
        • The MemoryMeter queue is unbounded, and can keep memtables on-heap long after they've been flushed. We should keep a reference to CFS instead and measure the currently active memtable when the task runs.
        • forceFlush acquires Table.switchLock.writeLock temporarily, which means that it will block as long as readLock is held. In particular, this means it will block for CommitLog.add. This can actually be worse for PCLE than BCLE, since the former will allow up to 1K unwritten entries on its queue. If the queue is full of large blobs like the ones in this workload, clearing up space in that queue can take a while. (The queue itself can also be a significant source of memory consumption!) We should move forceFlush to a separate executor so it can be truely nonblocking (which will allow MeteredFlusher to impose a moratorium on writes to the other CFs that should be flushed that much faster), and also reduce the CL queue size. There is no reasonable way to adjust the queue size at runtime, so a configuration setting should be introduced.

        IMO, all but the new pre-flush executor changes are reasonable to make in 1.2.10; the last I'd rather keep 2.0-only.

        Show
        Jonathan Ellis added a comment - This workload presents several challenges to Cassandra: MemoryMeter is slow, and we [correctly] limit it to one CFS at a time. If MemoryMeter is busy measuring a large memtable, we can easily write a lot of data to other CFs (or even the same one in a new memtable). So we need to pick a more realistic liveRatio default to start with until we can get the first measurements. (This is not a problem for Ryan's workload here; extremely large blobs have a liveRatio close to 1 anyway. But it could be a problem for a workload with smaller cells.) The MemoryMeter queue is unbounded, and can keep memtables on-heap long after they've been flushed. We should keep a reference to CFS instead and measure the currently active memtable when the task runs. forceFlush acquires Table.switchLock.writeLock temporarily, which means that it will block as long as readLock is held. In particular, this means it will block for CommitLog.add. This can actually be worse for PCLE than BCLE, since the former will allow up to 1K unwritten entries on its queue. If the queue is full of large blobs like the ones in this workload, clearing up space in that queue can take a while. (The queue itself can also be a significant source of memory consumption!) We should move forceFlush to a separate executor so it can be truely nonblocking (which will allow MeteredFlusher to impose a moratorium on writes to the other CFs that should be flushed that much faster), and also reduce the CL queue size. There is no reasonable way to adjust the queue size at runtime, so a configuration setting should be introduced. IMO, all but the new pre-flush executor changes are reasonable to make in 1.2.10; the last I'd rather keep 2.0-only.
        Hide
        Jonathan Ellis added a comment -

        NB: even with these changes, testing shows that thousands of CFs fragments writes to the point that compaction and flush are effectively performing random i/o. Rather than introduce a "global flush log" that imposes a moratorium on all CFs when the memtable memory budget is exeeded for this corner case, it is recommended to reduce concurrent_writes instead until flushing is able to keep up with the ingest.

        Show
        Jonathan Ellis added a comment - NB: even with these changes, testing shows that thousands of CFs fragments writes to the point that compaction and flush are effectively performing random i/o. Rather than introduce a "global flush log" that imposes a moratorium on all CFs when the memtable memory budget is exeeded for this corner case, it is recommended to reduce concurrent_writes instead until flushing is able to keep up with the ingest.
        Show
        Jonathan Ellis added a comment - Pushed branches to https://github.com/jbellis/cassandra/tree/5982-1.2 and https://github.com/jbellis/cassandra/tree/5982-2.0
        Show
        Jeremiah Jordan added a comment - We should add https://github.com/jbellis/cassandra/commit/2e22cf23ec4e18cd99b69bb3d419931c55e3ba93 to tpstats also.
        Hide
        Yuki Morishita added a comment -

        5982-1.2 is looking good to me (nit: unused import sun.security.provider.Sun in Memtable).

        Approach in 5982-2.0 works, though changes to Future<Future> breaks some of the places like CommitLogReplayer that waits on only the first Future. So I think we need to fix those too.
        Also, ColumnFamilyStore#reload needs to acquire write lock, so we have to switch switchMemtable to switchMemtableInternal there.

        Show
        Yuki Morishita added a comment - 5982-1.2 is looking good to me (nit: unused import sun.security.provider.Sun in Memtable). Approach in 5982-2.0 works, though changes to Future<Future> breaks some of the places like CommitLogReplayer that waits on only the first Future. So I think we need to fix those too. Also, ColumnFamilyStore#reload needs to acquire write lock, so we have to switch switchMemtable to switchMemtableInternal there.
        Hide
        Yuki Morishita added a comment -

        In relation to CASSANDRA-5605, it may be better to change preExecutor -> flushwriter -> postExecuter chain with something different that correctly handles errors thrown.

        Show
        Yuki Morishita added a comment - In relation to CASSANDRA-5605 , it may be better to change preExecutor -> flushwriter -> postExecuter chain with something different that correctly handles errors thrown.
        Hide
        Jonathan Ellis added a comment -

        We should add https://github.com/jbellis/cassandra/commit/2e22cf23ec4e18cd99b69bb3d419931c55e3ba93 to tpstats also.

        I'll turn Memtable.meterExecutor into a JMXEnabledTPE, which is what shows up in tpstats, but there isn't really a good way to shoehorn the commitlog executor in there.

        Show
        Jonathan Ellis added a comment - We should add https://github.com/jbellis/cassandra/commit/2e22cf23ec4e18cd99b69bb3d419931c55e3ba93 to tpstats also. I'll turn Memtable.meterExecutor into a JMXEnabledTPE, which is what shows up in tpstats, but there isn't really a good way to shoehorn the commitlog executor in there.
        Hide
        Jonathan Ellis added a comment -

        5982-1.2 is looking good to me (nit: unused import sun.security.provider.Sun in Memtable).

        Committed w/ nit fixed.

        Show
        Jonathan Ellis added a comment - 5982-1.2 is looking good to me (nit: unused import sun.security.provider.Sun in Memtable). Committed w/ nit fixed.
        Hide
        Jonathan Ellis added a comment -

        it may be better to change preExecutor -> flushwriter -> postExecuter chain with something different that correctly handles errors thrown.

        I'll just drop this for now.

        Show
        Jonathan Ellis added a comment - it may be better to change preExecutor -> flushwriter -> postExecuter chain with something different that correctly handles errors thrown. I'll just drop this for now.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Ryan McGuire
            Reviewer:
            Yuki Morishita
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development