I've updated my branch at https://github.com/pcmanus/cassandra/commits/3708 to add efficient on-disk handling of the new range tombstones.
The idea is that we don't want to have to read every range tombstone for each query, but only the ones corresponding to the columns queried. For that, the idea is to write the range tombstone along with the columns themselves. So the basic principal of the patch is that if we have a range tombstone RT[x, y] deleting all columns between x and y, we write a tombstone marker on disk before column x. Of course in practice that's more complicated because we want to be sure to read that tombstone even if we read only say y. To ensure that, such tombstone marker is repeated at the beginning of every column block (index block) the range covers (the code is smart enough to not repeat a marker that is superseded by other ones so there won't be a lot of such repeated marker at the beginning of each block in practice).
Note that those tombstone marker are only specific for the on-disk format (in memory we use an interval tree), which has 2 consequences for the patch:
- the on-disk format now diverges a little bit from the wire format. So the code separates (hopefullly) cleanly serialization functions that deal with on-disk format from the others. I don't think it's a bad idea to have that distinction anyway since we don't want to break the wire protocol but it's ok to change the on-disk one.
- on-disk column iterators (SSTable
Iterator) have to handle those tombstone markers that are not columns per-se. I.e, after having read them from disk we want to store them in the interval tree of the ColumnFamily object, not as an IColumn in the ColumnFamily map. To do this distinction, the code introduces an interface called OnDiskAtom, which represent basically either a column or a range tombstone. And the sstable iterators return those OnDiskAtom which are then ultimately added correctly to the resulting ColumnFamily object. I do think this is the clean way to handle this, but this is responsible for quite a bit of code diffs.
I'll also note that both those changes should be useful for
CASSANDRA-4180 too to handle the end-of-row marker described in that issue.
Now I admit this patch is not a small one, but unit tests are passing and there is a few basic tests at https://github.com/pcmanus/cassandra-dtest/commits/3708_tests.
Lastly, I'll add that the support for this by CQL3 is minimal as of this patch. We only allow what is basically the equivalent of the 'delete a whole super column' behavior. But it would be simple to allow for more generic use of range tombstones, i.e to allow stuff like:
DELETE FROM test WHERE k=0 AND c > 3 and c <= 10
But the patch is big enough that we can see that later.