Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6918

Compaction Assert: Incorrect Row Data Size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • None
    • None
    • None
    • Normal

    Description

      I have four tables in a schema with Replication Factor: 6 (previously we set this to 3, but when we added more nodes we figured adding more replication to improve read time would help, this might have aggravated the issue).

      create table table_value_one (
      id timeuuid PRIMARY KEY,
      value_1 counter
      );

      create table table_value_two (
      id timeuuid PRIMARY KEY,
      value_2 counter
      );

      create table table_position_lookup (
      value_1 bigint,
      value_2 bigint,
      id timeuuid,
      PRIMARY KEY (id)
      ) WITH compaction=

      {'class': 'LeveledCompactionStrategy'}

      ;

      create table sorted_table (
      row_key_index text,
      range bigint,
      sorted_value bigint,
      id timeuuid,
      extra_data list<bigint>,
      PRIMARY KEY ((row_key_index, range), sorted_value, id)
      ) WITH CLUSTERING ORDER BY (sorted_value DESC) AND
      compaction=

      {'class': 'LeveledCompactionStrategy'}

      ;

      The application creates an object, and stores it in sorted_table based on a value position - for example, an object has a value_1 of 5500, and a value_2 of 4300.

      There are rows which represent indices by which I can sort items based on these values in descending order. If I wish to see items with the highest # of value_1, I can create an index that stores them like so:

      row_key_index = 'highest_value_1s'

      Additionally, we shard each row by bucket ranges - which is simply the value_1 or value_2 / 1000. For example, our object above would be found in row_key_index = 'highest_value_1s' and range 5000, and also in row_key_index = 'highest_value_2s' with range 4300.

      The true values of this object are stored in two counter tables, table_value_one and table_value_two. The current indexed position is stored in table_position_lookup.

      We allow the application to modify value_one and value_two in the counter table indiscriminately. If we know the current values for these are dirty, we wait a tuned amount of time before we update the position in the sorted_table index. This creates 2 delete operations, and 2 write operations on the same table.

      The issue is when we expand the number of write/delete operations on sorted_table, we see the following assert in the system log:

      ERROR [CompactionExecutor:169] 2014-03-24 08:07:12,871 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:169,1,main]
      java.lang.AssertionError: incorrect row data size 77705872 written to /var/lib/cassandra/data/loadtest_1/sorted_table/loadtest_1-sorted_table-tmp-ic-165-Data.db; correct is 77800512
      at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
      at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162)
      at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
      at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
      at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
      at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
      at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:724)

      Each object creates approximately ~500 unique row keys in sorted_table, and it possesses an extra_data field containing approximately 15 different bigint values.

      Previously, our application was running Cassandra 1.2.10 and we did not see the assert when our sorted_table did not have the "extra data list<bigint>". Also, we were writing around ~200 unique row keys, only containing the ID column.

      We tried both leveled compaction and size tiered compaction and both cause the same assert - compaction fails to happen, and after about 100k object writes (creating 55 million rows, each having potentially as many as 100k items in a single column), we have ~ 2.4 GB of SSTables spread across 4840 files, and 691 SSTables:

      SSTable count: 691
      SSTables in each level: [685/4, 6, 0, 0, 0, 0, 0, 0, 0]
      Space used (live): 2244774352
      Space used (total): 2251159892
      SSTable Compression Ratio: 0.15101393198465862
      Number of Keys (estimate): 4704128
      Memtable Columns Count: 0
      Memtable Data Size: 0
      Memtable Switch Count: 264
      Read Count: 9204
      Read Latency: NaN ms.
      Write Count: 10151343
      Write Latency: NaN ms.
      Pending Tasks: 0
      Bloom Filter False Positives: 0
      Bloom Filter False Ratio: 0.00000
      Bloom Filter Space Used: 3500496
      Compacted row minimum size: 125
      Compacted row maximum size: 62479625
      Compacted row mean size: 1285302
      Average live cells per slice (last five minutes): 1001.0
      Average tombstones per slice (last five minutes): 8566.5

      Some mitigation strategies we have discussed include:

      • Breaking sorted_table into multiple column families to spread the # of writes between.
      • Increasing the coalescing time delay
      • Removing extra_data and paying the cost of another table look up for each item
      • Compressing extra_data into a blob
      • Reduce replication factor back down to 3 to reduce size pressure on SSTable.

      Running nodetool -pr repair does not fix the issue. Running nodetool compact manually has not solved the issue as well. The asserts happen pretty frequently across all nodes of the cluster.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              agoodrich Alexander Goodrich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: