Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Duplicate
-
None
-
None
-
None
-
11 node Linux Cassandra 1.2.15 cluster, each node configured as follows:
2P IntelXeon CPU X5660 @ 2.8 GHz (12 cores, 24 threads total)
148 GB RAM
CentOS release 6.4 (Final)
2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)Node configuration:
Default cassandra.yaml settings for the most part with the following exceptions:
rpc_server_type: hsha11 node Linux Cassandra 1.2.15 cluster, each node configured as follows: 2P IntelXeon CPU X5660 @ 2.8 GHz (12 cores, 24 threads total) 148 GB RAM CentOS release 6.4 (Final) 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) Node configuration: Default cassandra.yaml settings for the most part with the following exceptions: rpc_server_type: hsha
-
Normal
Description
I have four tables in a schema with Replication Factor: 6 (previously we set this to 3, but when we added more nodes we figured adding more replication to improve read time would help, this might have aggravated the issue).
create table table_value_one (
id timeuuid PRIMARY KEY,
value_1 counter
);
create table table_value_two (
id timeuuid PRIMARY KEY,
value_2 counter
);
create table table_position_lookup (
value_1 bigint,
value_2 bigint,
id timeuuid,
PRIMARY KEY (id)
) WITH compaction=
;
create table sorted_table (
row_key_index text,
range bigint,
sorted_value bigint,
id timeuuid,
extra_data list<bigint>,
PRIMARY KEY ((row_key_index, range), sorted_value, id)
) WITH CLUSTERING ORDER BY (sorted_value DESC) AND
compaction=
;
The application creates an object, and stores it in sorted_table based on a value position - for example, an object has a value_1 of 5500, and a value_2 of 4300.
There are rows which represent indices by which I can sort items based on these values in descending order. If I wish to see items with the highest # of value_1, I can create an index that stores them like so:
row_key_index = 'highest_value_1s'
Additionally, we shard each row by bucket ranges - which is simply the value_1 or value_2 / 1000. For example, our object above would be found in row_key_index = 'highest_value_1s' and range 5000, and also in row_key_index = 'highest_value_2s' with range 4300.
The true values of this object are stored in two counter tables, table_value_one and table_value_two. The current indexed position is stored in table_position_lookup.
We allow the application to modify value_one and value_two in the counter table indiscriminately. If we know the current values for these are dirty, we wait a tuned amount of time before we update the position in the sorted_table index. This creates 2 delete operations, and 2 write operations on the same table.
The issue is when we expand the number of write/delete operations on sorted_table, we see the following assert in the system log:
ERROR [CompactionExecutor:169] 2014-03-24 08:07:12,871 CassandraDaemon.java (line 191) Exception in thread Thread[CompactionExecutor:169,1,main]
java.lang.AssertionError: incorrect row data size 77705872 written to /var/lib/cassandra/data/loadtest_1/sorted_table/loadtest_1-sorted_table-tmp-ic-165-Data.db; correct is 77800512
at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:162)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Each object creates approximately ~500 unique row keys in sorted_table, and it possesses an extra_data field containing approximately 15 different bigint values.
Previously, our application was running Cassandra 1.2.10 and we did not see the assert when our sorted_table did not have the "extra data list<bigint>". Also, we were writing around ~200 unique row keys, only containing the ID column.
We tried both leveled compaction and size tiered compaction and both cause the same assert - compaction fails to happen, and after about 100k object writes (creating 55 million rows, each having potentially as many as 100k items in a single column), we have ~ 2.4 GB of SSTables spread across 4840 files, and 691 SSTables:
SSTable count: 691
SSTables in each level: [685/4, 6, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 2244774352
Space used (total): 2251159892
SSTable Compression Ratio: 0.15101393198465862
Number of Keys (estimate): 4704128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 264
Read Count: 9204
Read Latency: NaN ms.
Write Count: 10151343
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 3500496
Compacted row minimum size: 125
Compacted row maximum size: 62479625
Compacted row mean size: 1285302
Average live cells per slice (last five minutes): 1001.0
Average tombstones per slice (last five minutes): 8566.5
Some mitigation strategies we have discussed include:
- Breaking sorted_table into multiple column families to spread the # of writes between.
- Increasing the coalescing time delay
- Removing extra_data and paying the cost of another table look up for each item
- Compressing extra_data into a blob
- Reduce replication factor back down to 3 to reduce size pressure on SSTable.
Running nodetool -pr repair does not fix the issue. Running nodetool compact manually has not solved the issue as well. The asserts happen pretty frequently across all nodes of the cluster.
Attachments
Issue Links
- duplicates
-
CASSANDRA-4180 Single-pass compaction for LCR
- Resolved
- is related to
-
CASSANDRA-7543 Assertion error when compacting large row with map//list field or range tombstone
- Resolved