This is a document bringing up some issues when DTCS is used to compact time series data in a three node cluster. The DTCS is currently configured with a few parameters that are making the configuration fairly simple, but might cause problems in certain special cases like recovering from the flood of small SSTables due to repair operation. We are suggesting some ideas that might be a starting point for further discussions. Following sections are containing:
- Description of the cassandra setup
- Feeding process of the data
- Failure testing
- Issues caused by the repair operations for the DTCS
- Proposal for the DTCS configuration parameters
Attachments are included to support the discussion and there is a separate section giving explanation for those.
Cassandra setup and data model
- Cluster is composed from three nodes running Cassandra 2.1.2. Replication factor is two and read and write consistency levels are ONE.
- Data is time series data. Data is saved so that one row contains a certain time span of data for a given metric ( 20 days in this case). The row key contains information about the start time of the time span and metrix name. Column name gives the offset from the beginning of time span. Column time stamp is set to correspond time stamp when adding together the timestamp from the row key and the offset (the actual time stamp of data point). Data model is analog to KairosDB implementation.
- Average sampling rate is 10 seconds varying significantly from metric to metric.
- 100 000 metrics are fed to the Cassandra.
- max_sstable_age_days is set to 5 days (objective is to keep SStable files in manageable size, around 50 GB)
- TTL is not in use in the test.
Procedure for the failure test.
- Data is first dumped to Cassandra for 11 days and the data dumping is stopped so that DTCS will have a change to finish all compactions. Data is dumped with "fake timestamps" so that column time stamp is set when data is written to Cassandra.
- One of the nodes is taken down and new data is dumped on top of the earlier data covering couple of hours worth of data (faked time stamps).
- Dumping is stopped and the node is kept down for few hours.
- Node is taken up and the "nodetool repair" is applied on the node that was down.
- Repair operation will lead to massive amount of new SStables far back in the history. New SStables are covering similar time spans than the files that were created by DTCS before the shutdown of one of the nodes.
- To be able to compact the small files the max_sstable_age_days should be increased to allow compaction to handle the files. However, the in a practical case the time window will increase so large that generated files will be huge that is not desirable. The compaction also combines together one very large file with a bunch of small files in several phases that is not effective. Generating really large files may also lead to out of disc space problems.
- See the list of time graphs later in the document.
Improvement proposals for the DTCS configuration
Below is a list of desired properties for the configuration. Current parameters are mentioned if available.
- Initial window size (currently:base_time_seconds)
- The amount of similar size windows for the bucketing (currently: min_threshold)
- The multiplier for the window size when increased (currently: min_threshold). This we would like to be independent from the min_threshold parameter so that you could actually control the rate how fast the window size is increased.
- Maximum length of the time window inside which the files are assigned for a certain bucket (not currently defined). This means that expansion of time window length is restricted. When the limit is reached the window size will be same all the way back in the history (e.g. one week)
- The maximum horizon in which SStables are candidates for buckets (currently: max_sstable_age_days)
- Maximum file size of SStable allowed to be in a set of files to be compacted (not possible currently). Preventing out of disk space situations.
- Optional strategies to select the most interesting bucket:
- Minimum amount of SStables in the time window before it is a candidate for the most interesting bucket (currently: min_threshold for the most recent window, otherwise two). Being able set this value independently would allow to put most of the efforts on those areas where a large amount of small files should be compacted together instead of few new files.
- Optionally, the criteria for the most interesting bucket could be set: e.g. select the window with most files to be compacted.
- Inside the bucket when the amount of files is limited by max_threshold, the compaction would select first small files instead of one huge file and a bunch of small files.
The above set of parameters allows to recover from repair operations producing large amount of small SStables.
- Maximum length of the time window for compactions would keep the compacted SStable size in reasonable range and would allow to extend the horizon far back in the history
- Combining small files together instead of combining one huge file with e.g. 31 small files again and again is more disk efficient
In addition to the previous advantages the above parameters would also allow:
- Dumping of more data in the history (e.g. new metrics) by assigning the correct timestamp for the column (fake time stamp) and proper compaction of new and existing SStables.
- Expiring reasonable size SStable with TTL even if the compactions would be intermittently executed far back in the history. In this case the new data has to fed with TTL calculated dynamically.
- Note: Being able to give the absolute time stamp for the column expiry time would be beneficial when data is dumped back in the history. This is the case when you move data from some legacy system to Cassandra with faked time stamps and would like to keep the data only a certain time period. Currently the absolute time stamp is calculated by Cassandra from the system time and given TTL. TTL has to be calculated dynamically based on the current time and desired expiry moment making things more complex.
One interesting question is that why those duplicate SStable files are created? The duplication problem could not be produced when the data was dumped with following spec:
- 100 metrics
- 20 days of data in one row
- one year of data
- max_sstable_age_days = 15
- memtable_offheap_space_in_mb was decreased so that small SStables were created (to create something to be compacted)
- One node was taken down and one more day of data was dumped on top of the earlier data
- "nodetool repair -pr" was executed on each node => duplicates were checked in each step => no duplicates
- "nodetool repair" was executed on a node that was down => no duplicates were generated
Time graphs of content of SSTables from different phases of the test run:
Fields in the below time graphs are following:
- Order number from the SSTable file name
- Minimum column timestamp in the SSTable file
- Timespan representation graphically
- Maximum column time stamp in SStable
- The size of the SStable in megabytes
Time graphs after dumping the 11 days of data and letting all compactions to run through
node2_20150621_1646_time_graph.txt (error: same as for node1, but the behavior is same)
Time graphs after taking one node down (node2) and dumping couple of hours of mode data
Time graphs when the repair operation has finished and compactions are done. Compactions will naturally handle only the files inside the max_sstable_age_days range.
==> Now there is a large amount of small files covering pretty much same areas as the original SStables
Trend from the SStable count as a function of time on each node.
1) Clearing the database and dumping the 11 days worth of data
2) Stopping the dumping and letting compactions run
3) Taking one node down (top bottom one in figure) and dumping few hours of new data on top of earlier data
4) Starting the repair operation
5) Repair operation finished
Nodetool status prints before and after repair operation
nodetool status infos.txt
Log files were parsed to demonstrate the creation of new small SStables and the combination of one large file with a bunch of small ones. This is done from the time range where the max_sstable_age_days is able to reach (in this case 5 days). The hierarchy of the files is shown in the file "sstable_compaction_trace.txt". The first flushed file can be found from the line 10.
Each line represents either a flushed SStable or the SStable created by DTCS. For flushed files the timestamp indicates the time period the file represents. For compacted files (marked with C) the first timestamp represents the moment when the compaction was done (wall clock). Time stamps are faked when written to the database. The size of the file is the last field. The first field with number in parenthesis shows the level of the file. Top level files marked with (0) are those that don't have any predecessors and should be found from the disk also.
SStables that are created by the repair operation are not mentioned in the log files so they are handled as phantom files. The existence of file can be concluded from the predecessor list of compacted SStable. Those are marked with None,None in timestamps.
In the file "sstable_compaction_trace_snipped.txt" is one portion that shows the compaction hierarchy for the small files originating from the repair operation. max_threshold is in the default value of 32. In each step 31 tiny files are compacted together with 46 GB file.