[CASSANDRA-9644] DTCS configuration proposals for handling consequences of repairs - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: None
Component/s: None
Labels:
- compaction
- dtcs

Description

This is a document bringing up some issues when DTCS is used to compact time series data in a three node cluster. The DTCS is currently configured with a few parameters that are making the configuration fairly simple, but might cause problems in certain special cases like recovering from the flood of small SSTables due to repair operation. We are suggesting some ideas that might be a starting point for further discussions. Following sections are containing:

Description of the cassandra setup
Feeding process of the data
Failure testing
Issues caused by the repair operations for the DTCS
Proposal for the DTCS configuration parameters

Attachments are included to support the discussion and there is a separate section giving explanation for those.

Cassandra setup and data model

Cluster is composed from three nodes running Cassandra 2.1.2. Replication factor is two and read and write consistency levels are ONE.
Data is time series data. Data is saved so that one row contains a certain time span of data for a given metric ( 20 days in this case). The row key contains information about the start time of the time span and metrix name. Column name gives the offset from the beginning of time span. Column time stamp is set to correspond time stamp when adding together the timestamp from the row key and the offset (the actual time stamp of data point). Data model is analog to KairosDB implementation.
Average sampling rate is 10 seconds varying significantly from metric to metric.
100 000 metrics are fed to the Cassandra.
max_sstable_age_days is set to 5 days (objective is to keep SStable files in manageable size, around 50 GB)
TTL is not in use in the test.

Procedure for the failure test.

Data is first dumped to Cassandra for 11 days and the data dumping is stopped so that DTCS will have a change to finish all compactions. Data is dumped with "fake timestamps" so that column time stamp is set when data is written to Cassandra.
One of the nodes is taken down and new data is dumped on top of the earlier data covering couple of hours worth of data (faked time stamps).
Dumping is stopped and the node is kept down for few hours.
Node is taken up and the "nodetool repair" is applied on the node that was down.

Consequences

Repair operation will lead to massive amount of new SStables far back in the history. New SStables are covering similar time spans than the files that were created by DTCS before the shutdown of one of the nodes.
To be able to compact the small files the max_sstable_age_days should be increased to allow compaction to handle the files. However, the in a practical case the time window will increase so large that generated files will be huge that is not desirable. The compaction also combines together one very large file with a bunch of small files in several phases that is not effective. Generating really large files may also lead to out of disc space problems.
See the list of time graphs later in the document.

Improvement proposals for the DTCS configuration

Below is a list of desired properties for the configuration. Current parameters are mentioned if available.

Initial window size (currently:base_time_seconds)
The amount of similar size windows for the bucketing (currently: min_threshold)
The multiplier for the window size when increased (currently: min_threshold). This we would like to be independent from the min_threshold parameter so that you could actually control the rate how fast the window size is increased.
Maximum length of the time window inside which the files are assigned for a certain bucket (not currently defined). This means that expansion of time window length is restricted. When the limit is reached the window size will be same all the way back in the history (e.g. one week)
The maximum horizon in which SStables are candidates for buckets (currently: max_sstable_age_days)
Maximum file size of SStable allowed to be in a set of files to be compacted (not possible currently). Preventing out of disk space situations.
Optional strategies to select the most interesting bucket:
Minimum amount of SStables in the time window before it is a candidate for the most interesting bucket (currently: min_threshold for the most recent window, otherwise two). Being able set this value independently would allow to put most of the efforts on those areas where a large amount of small files should be compacted together instead of few new files.
Optionally, the criteria for the most interesting bucket could be set: e.g. select the window with most files to be compacted.
Inside the bucket when the amount of files is limited by max_threshold, the compaction would select first small files instead of one huge file and a bunch of small files.

The above set of parameters allows to recover from repair operations producing large amount of small SStables.

Maximum length of the time window for compactions would keep the compacted SStable size in reasonable range and would allow to extend the horizon far back in the history
Combining small files together instead of combining one huge file with e.g. 31 small files again and again is more disk efficient

In addition to the previous advantages the above parameters would also allow:

Dumping of more data in the history (e.g. new metrics) by assigning the correct timestamp for the column (fake time stamp) and proper compaction of new and existing SStables.
Expiring reasonable size SStable with TTL even if the compactions would be intermittently executed far back in the history. In this case the new data has to fed with TTL calculated dynamically.
Note: Being able to give the absolute time stamp for the column expiry time would be beneficial when data is dumped back in the history. This is the case when you move data from some legacy system to Cassandra with faked time stamps and would like to keep the data only a certain time period. Currently the absolute time stamp is calculated by Cassandra from the system time and given TTL. TTL has to be calculated dynamically based on the current time and desired expiry moment making things more complex.

One interesting question is that why those duplicate SStable files are created? The duplication problem could not be produced when the data was dumped with following spec:

100 metrics
20 days of data in one row
one year of data
max_sstable_age_days = 15
memtable_offheap_space_in_mb was decreased so that small SStables were created (to create something to be compacted)
One node was taken down and one more day of data was dumped on top of the earlier data
"nodetool repair -pr" was executed on each node => duplicates were checked in each step => no duplicates
"nodetool repair" was executed on a node that was down => no duplicates were generated

--------------------------------------------------------------------------------------------------------------------

ATTACHMENTS

Time graphs of content of SSTables from different phases of the test run:

*************************************************************

Fields in the below time graphs are following:

Order number from the SSTable file name
Minimum column timestamp in the SSTable file
Timespan representation graphically
Maximum column time stamp in SStable
The size of the SStable in megabytes

Time graphs after dumping the 11 days of data and letting all compactions to run through

node0_20150621_1646_time_graph.txt
node1_20150621_1646_time_graph.txt
node2_20150621_1646_time_graph.txt (error: same as for node1, but the behavior is same)

Time graphs after taking one node down (node2) and dumping couple of hours of mode data

node0_20150621_2320_time_graph.txt
node1_20150621_2320_time_graph.txt
node2_20150621_2320_time_graph.txt

Time graphs when the repair operation has finished and compactions are done. Compactions will naturally handle only the files inside the max_sstable_age_days range.

==> Now there is a large amount of small files covering pretty much same areas as the original SStables

node0_20150623_1526_time_graph.txt
node1_20150623_1526_time_graph.txt
node2_20150623_1526_time_graph.txt

-----------
Trend from the SStable count as a function of time on each node.
sstable_counts.jpg

Vertical lines:

1) Clearing the database and dumping the 11 days worth of data
2) Stopping the dumping and letting compactions run
3) Taking one node down (top bottom one in figure) and dumping few hours of new data on top of earlier data
4) Starting the repair operation
5) Repair operation finished

--------
Nodetool status prints before and after repair operation
nodetool status infos.txt

--------------
Tracing compactions

Log files were parsed to demonstrate the creation of new small SStables and the combination of one large file with a bunch of small ones. This is done from the time range where the max_sstable_age_days is able to reach (in this case 5 days). The hierarchy of the files is shown in the file "sstable_compaction_trace.txt". The first flushed file can be found from the line 10.

Each line represents either a flushed SStable or the SStable created by DTCS. For flushed files the timestamp indicates the time period the file represents. For compacted files (marked with C) the first timestamp represents the moment when the compaction was done (wall clock). Time stamps are faked when written to the database. The size of the file is the last field. The first field with number in parenthesis shows the level of the file. Top level files marked with (0) are those that don't have any predecessors and should be found from the disk also.

SStables that are created by the repair operation are not mentioned in the log files so they are handled as phantom files. The existence of file can be concluded from the predecessor list of compacted SStable. Those are marked with None,None in timestamps.

In the file "sstable_compaction_trace_snipped.txt" is one portion that shows the compaction hierarchy for the small files originating from the repair operation. max_threshold is in the default value of 32. In each step 31 tiny files are compacted together with 46 GB file.

sstable_compaction_trace.txt
sstable_compaction_trace_snipped.txt

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sstable_counts.jpg
24/Jun/15 07:50
87 kB
Antti Nissinen
sstable_compaction_trace_snipped.txt
24/Jun/15 07:50
19 kB
Antti Nissinen
sstable_compaction_trace.txt
24/Jun/15 07:50
524 kB
Antti Nissinen
nodetool status infos.txt
24/Jun/15 07:50
0.8 kB
Antti Nissinen
node2_20150623_1526_time_graph.txt
24/Jun/15 07:50
151 kB
Antti Nissinen
node2_20150621_2320_time_graph.txt
24/Jun/15 07:50
3 kB
Antti Nissinen
node2_20150621_1646_time_graph.txt
24/Jun/15 07:50
3 kB
Antti Nissinen
node1_20150623_1526_time_graph.txt
24/Jun/15 07:50
78 kB
Antti Nissinen
node1_20150621_2320_time_graph.txt
24/Jun/15 07:50
2 kB
Antti Nissinen
node1_20150621_1646_time_graph.txt
24/Jun/15 07:50
3 kB
Antti Nissinen
node0_20150623_1526_time_graph.txt
24/Jun/15 07:50
93 kB
Antti Nissinen
node0_20150621_2320_time_graph.txt
24/Jun/15 07:50
2 kB
Antti Nissinen
node0_20150621_1646_time_graph.txt
24/Jun/15 07:50
2 kB
Antti Nissinen

Issue Links

is duplicated by

CASSANDRA-8371 DateTieredCompactionStrategy is always compacting

Resolved

CASSANDRA-10253 Incremental repairs not working as expected with DTCS

Resolved

Sub-Tasks

1.	Do STCS in DTCS-windows		Resolved	Marcus Eriksson
2.	Make DTCS work well with old data		Resolved	Marcus Eriksson

DTCS configuration proposals for handling consequences of repairs

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates