[CASSANDRA-33] Bugs in tombstone handling in remove code - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 0.3
Component/s: None
Labels:
None

Severity:
Normal

Description

[copied from dev list]

Avinash pointed out two bugs in my remove code. One is easy to fix,
the other is tougher.

The easy one is that my code removes tombstones (deletion markers) at
the ColumnFamilyStore level, so when CassandraServer does read repair
it will not know about the tombstones and they will not be replicated
correctly. This can be fixed by simply moving the removeDeleted call
up to just before CassandraServer's final return-to-client.

The hard one is that tombstones are problematic on GC (that is, major
compaction of SSTables, to use the Bigtable paper terminology).

One failure scenario: Node A, B, and C replicate some data. C goes
down. The data is deleted. A and B delete it and later GC it. C
comes back up. C now has the only copy of the data so on read repair
the stale data will be sent to A and B.

A solution: pick a number N such that we are confident that no node
will be down (and catch up on hinted handoffs) for longer than N days.
(Default value: 10?) Then, no node may GC tombstones before N days
have elapsed. Also, after N days, tombstones will no longer be read
repaired. (This prevents a node which has not yet GC'd from sending a
new tombstone copy to a node that has already GC'd.)

Implementation detail: we'll need to add a 32-bit "time of tombstone"
to ColumnFamily and SuperColumn. (For Column we can stick it in the
byte[] value, since we already have an unambiguous way to know if the
Column is in a deleted state.) We only need 32 bits since the time
frame here is sufficiently granular that we don't need ms. Also, we
will use the system clock for these values, not the client timestamp,
since we don't know what the source of the client timestamps is.

Admittedly this is suboptimal compared to being able to GC immediately
but it has the virtue of being (a) easily implemented, (b) with no
extra components such as a coordination protocol, and (c) better than
not GCing tombstones at all (the other easy way to ensure
correctness).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch
17/Apr/09 15:20
18 kB
Jonathan Ellis
0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch
17/Apr/09 15:20
13 kB
Jonathan Ellis
0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch
17/Apr/09 15:53
5 kB
Jonathan Ellis
0004_expose_remove_bug.patch
17/Apr/09 18:49
2 kB
Jun Rao
0004-and-5-v2.patch
17/Apr/09 19:10
4 kB
Jonathan Ellis
0005_fix_exposed_remove_bug.patch
17/Apr/09 18:50
1 kB
Jun Rao
0006_fix_sequencefile_bug.patch
18/Apr/09 01:11
2 kB
Jun Rao
0007_fix_another_sequencefile_bug.patch
20/Apr/09 16:27
1 kB
Jun Rao

Issue Links

blocks

CASSANDRA-34 Hinted handoff rows never get deleted

Resolved

CASSANDRA-87 read repair of tombstones on columnfamilies and supercolumns

Resolved

CASSANDRA-29 Change value to binary from string

Resolved

Activity

People

Assignee:: Jonathan Ellis

Reporter:: Jonathan Ellis

Authors:: Jonathan Ellis

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Apr/09 13:47

Updated:: 16/Apr/19 09:33

Resolved:: 20/Apr/09 16:39