[CASSANDRA-14279] Row Tombstones in separate sstables / separate compaction path - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: None
Component/s: Consistency/Repair, Legacy/Local Write-Read Paths, Local/Compaction
Labels:
None

Description

In my experience if data is not well organized into time windowed sstables, cassandra has enormous difficulty in actually deleting data if the data has a "medium term" lifetime and is commingled with data that isn't marked for death, as would happen with compactions or intermingled write patterns. Or for example, you might have an active working set and be archiving "unused" data to other tables or clusters. Or you may be purging data. Or you may be migrating/sharding/restructuring data. Whatever the case, you want that disk space back, and you might not be able to truncate.

In STCS and LCS, row tombstones are intermingled with column data and column tombstones. But a row tombstone represents a significant event in data lifecycle: large amounts of "droppable" data during compaction and a shortcut from reading data from other sstables. It could also enable writes to be discarded in rare data patterns if the row tombstone is ahead in time.

I am wondering that if row tombstones were isolated in their own sstables, separately compacted and merged, that it might enable compaction to work more efficiently:

reads can prioritize bloom filter lookups that indicate a row tombstone, getting the timestamp of the deletion first, then can use that in the data sstables to filter data or shortcircuit the data if the row data had an overall "most recent data timestamp".

compaction could be forced to reference all the row tombstone sstables, such that every time two or more "data" sstables are compacted, they must reference the row tombstones to purge data.

In LCS, this would be particularly useful in getting data out of the upper levels without having to wait for data to trickle up the tree. The row tombstones, being read-only inputs into the data sstable compactions, can be referenced in each of the LCS levels' parallel compactors.

Based on discussions in the dev list, this would appear to require some sort of customization to the memtable->sstable flushing process, and perhaps a different set of bloom filters.

Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they should be comparitively smaller and take less time to compact. They could be aggressively compacted on a different schedule than "data" sstables.

In addition, it may be easier to repair/synchronize row tombstones across the cluster if they have already been separated into their own sstables.

Column/range tombstones may also benefit from a similar separation, but my guess is those are much more numerous and large and fine-grained that they might as well coexist with the data.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Constance Eustace

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/Feb/18 19:40

Updated:: 16/Apr/19 09:29